Seriously Debugging the Text Indexer Code

  • 28 February 2010

After feeling like wading in honey during the past weeks I finally get around to squash some bugs in my text indexer code. The first one was the obligatory use of a null pointer in rare cases. I know, this should never happen. Found it, squashed it. Won’t happen again (I am pretty confident about this).

The next problem was a wrong string comparison when dealing with file extensions. Ignoring the “.” leads to match of “ps” and “props”. The latter is no PostScript® file and cannot be indexed (well, it can be, but it shouldn’t). “.” are from now on never ignored.

The test data consists of 3755 files. After filtering 648 documents remain (file extensions .doc, .htm, .html, .odp, .ods, .odt, .ps, .pdf, .php, .rtf, .txt, .xml, .xls). The results are indexed by means of the PostgreSQL text index function. The resulting database has a table size of 488 kiB (23 MiB documents, 19 MiB text index). Indexing works fairly well so far. The database should be more than sufficient for testing the front end. I’ll probably have a go at the content of the two DVDs I ordered a couple of weeks ago. Both DVDs contain 42914 files in 1106 directories. The total size is over 8 GiB. Maybe I publish the front end URL to the indexed Cryptome data. Let’s see.

