An internet wayback machine. Run your own instance! (https://doomanddarkness.eu/wiki/article/Garble)
root
Garble web time machine
Garble is an internet wayback machine ready for your local setup! It is written in Haskell and backed by a Postgres database.
Dependencies: persistent, conduit, yesod, warp, http-conduit, tagstream-conduit et al.
Get started
Database setup
Install and start the PostgreSQL server. Create a user "garble" and a database "garble" (the user "garble" should be its owner; it also needs login capability). The schema will be automatically created on first start.
Garble
Download and compile Garble using cabal. Example:
cabal sandbox init
cabal install --dependencies-only
cabal build
Use the admin tool to setup the database schema and your preferences. Example:
cabal run admin -- shell
[... migrations ...]
0/0> set directory "/var/garble"
Okay, set.
0/0> set admin "me@myself.com"
Okay, set.
0/0> set recent for 96 hours
Okay, set.
If you like, you can already add a download job:
0/0> enqueue "https//example.com/"
New: "https//example.com"
Job id: 1
In the default configuration, Garble will recurse three levels on the same host, and one level into outgoing links. TODO: document how to change this.
The daemons
To actually execute the queued download jobs you need to start the garbled daemon:
cabal run garbled
Garbled will now download from the queued URIs and store them to disk. HTML documents will be searched for hyperlinks and included resources (such as style sheets, images, scripts), which will be added to the download queue.
For the web interface we need yet another daemon:
cabal run delivery
The delivery daemon will listen on localhost:3020 and accept the following routes:
/c/${CID} -- get the document content with content id ${CID}
/d/${DID} -- get the content of the document with document id ${DID}
/h/${HASH} -- get the document content with store hash ${HASH}
/t/${DATETIME}?uri=${URI} -- get the document content for URI ${URI} most close to ${DATETIME}
/l?uri=${URI} -- get the last known document content for URI ${URI}
Human users will most commonly use one of the latter two routes. The content id or document id are useful for debugging purposes. The hash route is used for included style sheets and images.
If the delivered content is an HTML document, all contained hyperlinks and resource references are adapted to point to the closest matching known content. If the target is not known to garble yet, an absolute URI to the original location is inserted.
Example: You request "/t/2018-03-01T20:00:00?uri=http//example.com/". The page originally contains a hyperlink to http://example.net/, which is known to garble. Hence the link is replaced by "/t/2018-03-01T20:00:00?uri=http://example.net/".
It might also contain a hyperlink to the relative path "/some/strange/things", which is not tracked by Garble. Hence the link is replaced by "http://example.com/some/strange/things".
Admin stuff
The 'admin' tool is specifically designed to provide an easy-to-use interface for common administration tasks.
Add a new download job
The 'enqueue' command adds a new job to the queue:
42/255> enqueue "http//example.com/"
New: "http://example.com/
Job id: 256
Change the store location
The store location may be changed using the 'set directory' command:
42/256> set directory "/the/new/location"
This does however only affect new files. To move already downloaded files to the new location, use the 'move' command in the admin tool:
42/256> move
Both actions may be combined:
42/256> move "/the/new/location"
Remove duplicate content
The downloader will automatically try to avoid downloading duplicate content by observing the Last-Modified header. However, there are sites that don't provide the Last-Modified header, and there are occurences of the same content being found at different URIs. Thus, some duplicate content will pile up over time. As a counter-measure, Garble provides a deduplication command:
42/256> dedup
Removed duplicate /z/garble//2018-03-25/2018032504df9b0f9c578733239a891bbfbd98518cda16a1670b655921ed5f3928d10ef88f633e0b90e61d939f5da4232d6b1803cbae62d59a51c0a4297ff34b8bd4d760alldagif.gz
Removed duplicate /z/garble//2018-03-25/20180325a993084c7423d75dbf6648f4e5a375acf2830fa56d5cd1903043f6b4d03d0b57e7a4df600ae631b4804ebb138d8728109aceb7a009b9eb3228550b9f592d6c9653322gif.gz
[...]
List the current queue
Table entries have the following order: job id, date and time of job creation, date and time of the job being queued for immediate execution, permitted recursion levels (same host/outgoing), URI.
42/256> list queue
43 2018-03-24T15:38:00 2018-03-25T03:05:19 2/1 https://example.com/home
44 2018-03-24T15:38:00 2018-03-25T03:05:20 1/1 https://example.net/static/tree.png
[...]
List recently finished jobs
Table entries have the following order: job id, date and time of job creation, date and time of the job being finished, measured file size, transmitted MIME type, URI.
42/256> list recent
42 2018-03-24T15:38:34 2018-03-25T03:09:02 137 KiB text/html; charset=utf-8 https://example.com/articles/2/
41 2018-03-24T15:38:34 2018-03-25T03:09:01 146 KiB image/png https://example.net/static/car.png
Permanent progress update
Use "follow" as a command-line argument to get a progress update every 30 seconds:
cabal run admin -- follow
Finished 117270 jobs out of 151218, that's 77 %
Finished 117338 jobs out of 151218, that's 77 %
Finished 117402 jobs out of 151218, that's 77 %
Finished 117461 jobs out of 151218, that's 77 %
[...]