Downloader and sentence alignment assistent for parallel Karelian and Finnish texts from Yleisradio.



This toolset allows to download Karelian news articles from Yleisradio, and their Finnish twin articles. It also provides a tool for manual sentence alignment.


To download a specific article from Yle, run the executable downloader with the command dump1 and add the URI as a command-line argument:

> cabal new-run downloader -- dump1

The is also the command dump2, which downloads a Karelian article and its Finnish twin article:

> cabal new-run downloader -- dump2

The command list prints a list of article URLs that belong to an article subject to the standard output:

> cabal new-run downloader -- list

The command dumpall1 downloads all articles belonging to a given subject:

> cabal new-run downloader -- dumpall1

The command dumpall2 does the same, but also downloads the Finnish twin articles:

> cabal new-run downloader -- dumpall2

The commands dump1 and dumpall1 save their results to the folder target/dumps, whilst dump2 and dumpall2 save their results to the folder target/parallel.


For sentence-alignment it's practical to use the align-assistent:

> cabal new-run align-assistent

It will traverse target/parallel and allow you to sentence-align each article one-by-one. The first article is loaded, split into sentences and stored to the two text files target/tmp/karelian and target/tmp/finnish. You can open them with a normal text editor. Each sentence is presented on its own line. Edit them such that the nth line of the Karelian article corresponds to the nth line of the Finnish article. When you're down, save both files and type next into the assistent. The aligned article is stored to target/aligned in a JSON format, and the next article is loaded in the same way as the first one.

If you want to take a break, you can stop the assistent with the quit command, and restart it later. The assistent will resume with the same article.


To see how many aligned sentences you already have, try the stats executable:

> cabal new-run stats