Downloader and sentence alignment assistent for parallel Karelian and Finnish texts from Yleisradio.
This toolset allows to download Karelian news articles from Yleisradio, and their Finnish twin articles. It also provides a tool for manual sentence alignment.
To download a specific article from Yle, run the executable
downloader with the command
dump1 and add the URI as a command-line argument:
> cabal new-run downloader -- dump1 https://yle.fi/the-article-path
The is also the command
dump2, which downloads a Karelian article and its Finnish twin article:
> cabal new-run downloader -- dump2 https://yle.fi/the-karelian-article
list prints a list of article URLs that belong to an article subject to the standard output:
> cabal new-run downloader -- list https://yle.fi/the-subject-listing
dumpall1 downloads all articles belonging to a given subject:
> cabal new-run downloader -- dumpall1 https://yle.fi/the-subject-listing
dumpall2 does the same, but also downloads the Finnish twin articles:
> cabal new-run downloader -- dumpall2 https://yle.fi/the-subject-listing
dumpall1 save their results to the folder
dumpall2 save their results to the folder
For sentence-alignment it's practical to use the
> cabal new-run align-assistent
It will traverse
target/parallel and allow you to sentence-align each article one-by-one. The first article is loaded, split into sentences and stored to the two text files
target/tmp/finnish. You can open them with a normal text editor. Each sentence is presented on its own line. Edit them such that the nth line of the Karelian article corresponds to the nth line of the Finnish article. When you're down, save both files and type
next into the assistent. The aligned article is stored to
target/aligned in a JSON format, and the next article is loaded in the same way as the first one.
If you want to take a break, you can stop the assistent with the
quit command, and restart it later. The assistent will resume with the same article.
To see how many aligned sentences you already have, try the
> cabal new-run stats