A variant of tokenizer-monad that supports streaming.

root

tokenizer-streaming

Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.

Main idea: A monad transformer called TokenizerT implements exactly the same methods as Tokenizer from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used MonadTokenizer in the type signatures)

Supported text types

  • streams of Char lists can be tokenized into streams of Char lists
  • streams of strict Text can be tokenized into streams of strict Text
  • streams of lazy Text can be tokenized into streams of lazy Text
  • streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
  • streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
  • bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text