A variant of tokenizer-monad that supports streaming.
Motivation: You might have stumpled upon the package tokenizer-monad. It is another project by me, for writing tokenizers that act on pure text/strings. However, there are situations when you cannot keep all the text in memory. You might want to tokenize text from network streams or from large corpus files.
Main idea: A monad transformer called
TokenizerT implements exactly the same methods as
Tokenizer from tokenizer-monad, such that all tokenizers can be ported without code changes (if you used
MonadTokenizer in the type signatures)
Supported text types
- streams of Char lists can be tokenized into streams of Char lists
- streams of strict Text can be tokenized into streams of strict Text
- streams of lazy Text can be tokenized into streams of lazy Text
- streams of strict ASCII ByteStrings can be tokenized into streams of strict ASCII ByteStrings
- streams of lazy ASCII ByteStrings can be tokenized into streams of lazy ASCII ByteStrings
- bytestring streams (from streaming-bytestring) with Unicode encodings (UTF-8, UTF-16 LE & BE, UTF-32 LE & BE) can be tokenized into streams of strict Text