Downloader and sentence alignment assistent for parallel Karelian and Finnish texts from Yleisradio.
root
- align-assistent
- commons
- downloader
- stats
- target
- CHANGELOG.md
- LICENSE
- README.md
- Setup.hs
- yle-uudizet-karjalakse.cabal
yle-uudizet-karjalakse
This toolset allows to download Karelian news articles from Yleisradio, and their Finnish twin articles. It also provides a tool for manual sentence alignment.
Downloader
To download a specific article from Yle, run the executable downloader
with the command dump1
and add the URI as a command-line argument:
> cabal new-run downloader -- dump1 https://yle.fi/the-article-path
The is also the command dump2
, which downloads a Karelian article and its Finnish twin article:
> cabal new-run downloader -- dump2 https://yle.fi/the-karelian-article
The command list
prints a list of article URLs that belong to an article subject to the standard output:
> cabal new-run downloader -- list https://yle.fi/the-subject-listing
The command dumpall1
downloads all articles belonging to a given subject:
> cabal new-run downloader -- dumpall1 https://yle.fi/the-subject-listing
The command dumpall2
does the same, but also downloads the Finnish twin articles:
> cabal new-run downloader -- dumpall2 https://yle.fi/the-subject-listing
The commands dump1
and dumpall1
save their results to the folder target/dumps
, whilst dump2
and dumpall2
save their results to the folder target/parallel
.
Alignment
For sentence-alignment it's practical to use the align-assistent
:
> cabal new-run align-assistent
It will traverse target/parallel
and allow you to sentence-align each article one-by-one. The first article is loaded, split into sentences and stored to the two text files target/tmp/karelian
and target/tmp/finnish
. You can open them with a normal text editor. Each sentence is presented on its own line. Edit them such that the nth line of the Karelian article corresponds to the nth line of the Finnish article. When you're down, save both files and type next
into the assistent. The aligned article is stored to target/aligned
in a JSON format, and the next article is loaded in the same way as the first one.
If you want to take a break, you can stop the assistent with the quit
command, and restart it later. The assistent will resume with the same article.
Statistics
To see how many aligned sentences you already have, try the stats
executable:
> cabal new-run stats