Library for parsing epub document metadata (Haskell)
#1Encoding issue when reading package xml from archive
Hi,
first of all: really thank you for the library!
I'm encountering an encoding issue when reading a package xml from zip or ByteString via Codec.Epub.IO.getPkgPathXmlFromZip
or Codec.Epub.IO.getPkgPathXmlFromBS
. I'm trying to read epub's with special chars for title or author. For example:
<dc:creator opf:file-as="Dicker, Joël" opf:role="aut">Dicker, Joël</dc:creator>
is scrambled to:
<dc:creator opf:file-as="Dicker, Joël" opf:role="aut">Dicker, Joël</dc:creator>
I located the source to string conversion (Lazy.ByteString -> String) in fileFromArchive
. I have a fix for it, I try to issue a pull request, after figuring out how to do it with darcs :)
Update: I'm currently get
user error (Text.Regex.Posix.String died: (ReturnCode 17,"illegal byte sequence"))
with and without my changes when running the test cases.
Okay this seems to be related: http://bugs.darcs.net/issue2357
switching the dependency
regex-compat
toregex-compat-tdfa
solves it for me. I'm not sure why I hadn't this problem before when directly installing this package from stackage.I will update my branch with a new patch
see patch for regex-compat: http://hub.darcs.net/MaxDaten/epub-metadata/patch/ddc5835ad8efcd81348792813f7b8f47a0d7e6cf
I think we have one problem with the current solution: It assumes every package xml is UTF8 encoded, but afaik this is not enforced with the epub(3) standard. For all current epubs I think it is fair to assume UTF8 encoded epub content, but I think older epubs aren't UTF8 encoded. A proper solution would be to detect the file encoding via the xml encoding tag and select the correct decoder. So please consider these patches more as a proof of concept than an ideal solution. But maybe this will be enough for now.
- status set to closed
Max! So sorry that it took me nearly a year to get around to this.
First, I merged your code, very cool, thank you so much for taking the time to work on it. Second, I added a unit test with your test case above.
I'd also like to say that I'm somewhat embarrassed by all of the String in this code. I really need to overhaul it to use Text I think in the future.
Also, this package now builds with stack and is in Stackage.
Thank you again
Thx for merging it, I will use 4.5 now :) Maybe I will find some time to support your rewriting.