Text Decoding – Progressing

  • 13 June 2010

The new document class takes form. I now have a prototype that can extract meta information from the filesystem (not that hard), detect/guess the encoding in case of pure text files (with the help of the ICU library), strip markup language (by replacing all tags) and detect/convert PDF documents (with the help of the PoDoFo PDF library). Converting HTML to text is a bit of a hack in C++. Maybe it is easier to use XSLT and let a XML library do the transformation as a whole. In theory HTML was built for this. However I still need to strip the tags of unknown XML documents in case someone wants to index XML stuff.

I forgot to extract the meta information from the HTML header. RegEx++; Dublin Core or non-standard tags, that is the question.

