Code

Code, Apps and Design Principles

  • Posted on November 17, 2012 at 4:13 pm

You probably know the term eye candy or dancing pigs. It applies to applications („apps“) running on computing platforms. It especially applies to apps running on devices that can be thrown easily. Widgets, nice colours and dancing pigs beats a sound security design every time. Since this posting is not about bashing Whatsapp (it’s really about sarcasm), here’s a list of advice for app „developers“.

  • If you in need of unique identifiers (UIDs), please always use information that can never be guessed and are very hard to obtain. Telephone numbers, names, e-mail addresses and device identifiers are closely guarded secrets which will never be disclosed, thus this is a very good choice.
  • If you are in the position of having to use easily guessable information for unique identifiers, make sure you scramble the information appropriately. For example you can use the MAC address of network devices, you just have to use an irreversible function on it. MD5(MAC) yields a result that can never be mistaken by a MAC address and cannot be reversed, so it is totally safe.
  • Everything you cannot understand is very safe. This is why you should never take a closer look at algorithms. Once you understand them, the attacker will do so, too. Then your advantage is lost. Make sure you never know why you are selecting a specific algorithm regardless of its purpose.
  • Always try to invent your own protocols. Since every protocol in existence was invented by amateurs with too much time on their hands, you can always do better.
  • Never reuse code. Libraries are for lazy bastards who cannot code. Rewrite every function and framework from scratch and include it into your software.
  • Put the most work into the user interface. If it looks good, the rest of the code automatically becomes very secure and professional. This goes for encryption as well. Most encryption algorithms can be easily replace by the right choice of colours.
  • Getting reverse engineered is never a problem if you explicitly forbid it in the terms of usage. Everyone reads and accepts this.
  • Aim for a very high number of users. Once you hit the 100,000 or 1,000,000 users, your software will be so popular that no one will ever attack it, because, well,  it’s too popular. Accept it as a fact, it’s too complicated to explain at this point.

Go and spread the word. I can’t wait to see more apps following these simple principles to be available in the app stores all over the world.

 

We Come in Peace – streaming now!

  • Posted on December 27, 2010 at 11:45 am

Für die nächsten 4 Tage werden wir im Büro neben der Arbeit mit den Videos vom 27C3 verbringen. Wir haben das Gast-WLAN aktiviert, und ein Beamer läuft (hoffentlich mit den Live Streams vom Congress). Das Programm sieht sehr vielversprechend aus. Wir sind schon gespannt wie gut das Streaming diesmal funktioniert. Letztes Jahr war es stabil genug für die meisten Vorträge.

Stream on!

Rain!

  • Posted on July 15, 2010 at 9:51 pm

Two good news: a) It’s raining. b) I dumped Xalan/Xerces in favour of libxslt from the GNOME project.

XML Parsers considered harmful

  • Posted on July 6, 2010 at 8:08 pm

I am fighting with Xalan and Xerces (in C++). After looking for decent tutorials (there are none) I found this little gem among the Google results. It clearly emphasises what I already know. XML parsers are from Hell. Xalan & Xerces are especially tricky since they’ve been ported from Java. The API is a bit weird. Some things contradict intuition. For example if you initialise the transformation engines more than once per process run, the destructor for the XSLTInputSource crashes with SIGSEGV. You get no clue. You return() and just as the objects get out of scope your program crashes. The secret is hidden in the API documentation. And you cannot easily stop the XSLT transformer from downloading/accessing the document’s DTD. You have to provide your own EntityResolver class that resolves all entities without DTD, if you wish to ignore it. Charming. Bureaucratic. Have I mentioned Java already?

Google result hit for XML parser software with a malware warning.

XML considered harmful.

If you know a decent and light-weight XSLT transformation library, let me know. I just need to delete tags from HTML, XHTML and XML documents (which worked well with regular expressions before). The XSLT template is quite short, and the task isn’t very complicated.

O_NOATIME + NFS = EPERM

  • Posted on June 28, 2010 at 12:26 am

I met a surprise today. I am writing code that accesses a lot of files via NFSv4, stat()s  them, possibly extracts content and writes stuff into a couple of databases. Somewhen in the debug/development cycle a stat() call returned Resource temporarily unavailable (a.k.a. EAGAIN and EWOULDBLOCK). I tried replacing stat() by lstat() and finally by fstat() in order to assert more control over the flags provided to open(). The combination O_RDONLY | O_SYNC | O_NOATIME changed EAGAIN into EPERM (Operation not permitted). Why is that? Well, here’s a hint.

The O_NOATIME flag was specified, but the effective user ID of the caller did not match the owner of the file and the caller was not privileged (CAP_FOWNER).

Correct. I changed the machine the test ran on. This turned the effective UID into something different (the NFS share showed the numerical 4294967294 which is not the UID of the development account). I’d never have expected this behaviour from the description in the man pages…despite the quoted sentence above…which is really part of man 2 open

RTFM. Again.

Text Decoding – Progressing

  • Posted on June 13, 2010 at 9:27 pm

The new document class takes form. I now have a prototype that can extract meta information from the filesystem (not that hard), detect/guess the encoding in case of pure text files (with the help of the ICU library), strip markup language (by replacing all tags) and detect/convert PDF documents (with the help of the PoDoFo PDF library). Converting HTML to text is a bit of a hack in C++. Maybe it is easier to use XSLT and let a XML library do the transformation as a whole. In theory HTML was built for this. However I still need to strip the tags of unknown XML documents in case someone wants to index XML stuff.

I forgot to extract the meta information from the HTML header. RegEx++; Dublin Core or non-standard tags, that is the question.

Texthaufen, Code und Regen

  • Posted on May 30, 2010 at 3:40 pm

Der Mai verging schneller als geplant. Ich habe daher nicht so viel vom Regen mitbekommen wie andere. Abgesehen davon war das der angenehmste Mai seit langem. Wer die Kühle der hessischen Wälder kennt, der wird vom Wüstenklima in Wien sehr unangenehm überrascht. Ich ziehe satte 17°C allem über 25°C jederzeit vor. Leider stehen für nächste Woche 30°C an…

Der Code zum Indizieren von Texthaufen ist gewachsen und wurde mit Korpi von 60000+ Dokumenten getestet. Die Erkenntnisse haben zur Beseitigung einiger Bugs geführt. Alle, die bisher dachten, daß der Unrat auf Dateiservern aus wohldefinierten und zugänglichen Dokumenten besteht, sollten diese Einstellung dringend hinterfragen. Dateiformate wie PDF, ODT, ODP oder ODS sind sehr gut zugänglich und meisten auch in eine indizierbare Form wandelbar. Dicht gefolgt ist dann XLS und PostScript®. Bei DOC kann es schon passieren, daß es statt DOC ein Text in RTF ist, aber die Dateierweiterung das nicht anzeigt. Dann gibt es noch DOC Dateien, die per Cut & Paste mit Text in einer seltsamen Kodierung gefüllt wurden. Es resultiert nach Normalisierung ein Text, der sich nicht in UTF-8 konvertieren läßt. Überhaupt ist die Kodierung ein großes Problem, da TXT und HTM(L) Dateien die verwendete Kodierung selten bis nie angeben. Genau aus diesem Grund haben Webbrowser Code an Bord, der Kodierungen errät.

Dateiformate sind das nächste Problem. Der Indexer wandelt alle interessanten Dokumente in reinen Text, da nur dieser indiziert wird. Es gibt nicht für alle Formate kommandozeilenbasierte Konverter. OOXML fällt mir spontan ein, dicht gefolgt von proprietären e-Book-Formaten. Solche Formate fallen derzeit durch den Rost.

Hört ihr Leut’ und laßt euch sagen, Textformate lassen mich verzagen. Bisher sind PDF, PostScript® und die OpenOffice Formate meine Favoriten.

Textabenteuer

  • Posted on May 18, 2010 at 1:31 pm

Der Mai ist dichter gepackt als ich dachte. Vorletzte Woche habe ich bei den Wiener Linuxwochen im Alten Rathaus verbracht. GNU/Linux gepaart mit barocker Architektur sieht man nicht alle Tage. Aus Denkmalschutzgründen gab es daher nur ein Funknetzwerk. Jetzt widme ich mich wieder anderen Problemen und viel Text – in Form von Code und eigentlichem Text. Ich teste CLucene und den PostgreSQL Textindizierer an Dokumenten aus dem „echten” Leben. Die Fragen, die sich dabei aufwerfen, sind schwieriger zu beantworten als es die Dokumentation erahnen läßt.

Zuerst muß man mal auf den eigentlich Text kommen. Es gibt einen Haufen von Dokumentformaten – .doc, .pdf, .odt, .html, .xml, .rtf, .odp, .xls, .txt, .ps, … . Diese muß man zuerst normalisieren bevor man sie indizieren kann. Man benötigt den puren Text, die einzelnen Worte, und sonst nichts. Obendrein sollte die Kodierung der Zeichen einheitlich sein. Es bietet sich UTF-8 an, also muß man ausreichend Konverter haben. Da einige der Dokumentenformate proprietär oder einfach nur schlecht entworfen sind, ist das keine triviale Aufgabe. Ich habe genug Konverter gefunden, aber einige sind besser als andere. Die Tests an den Dokumentensammlungen werden zeigen wie gut sie wirklich sind.

Dann kommt die Sprache. Das Indizieren von Text reduziert die darin vorkommenden Worte auf ihre Stammform und entfernt Stopworte. Beides hängt von der Sprache des Dokuments ab. Nun wird die Sprache leider nicht in allen Formaten als Metainformation mitgegeben. Man muß sie also ermitteln. Dazu kann man sich der Publikation N-Gram-Based Text Categorization bedienen bzw. eine ihrer Implementationen bemühen. Was passiert mit Texten gemischter Sprache?

Die Liste ist lang. Der Code ist C++, und mir fehlt eine schöne, erweiterbare Klasse, die Dokumente einliest, sie in UTF-8 und puren Text normalisiert sowie einige Metainformationen ausliest. Bisher habe ich nichts gefunden, was ich verwenden möchte. Ich werde es selbst mal versuchen. HTML und XML kann ich schon normalisieren. Für PDF empfehle ich die exzellente PoDoFo Bibliothek (die Spenden von mir bekommen wird). Für den Rest suche ich noch.

Apropos Worte: Kennt wer die Sprache Yup’ik? Sie wird von sehr wenigen Inuit in Alaska und dergleichen gesprochen. Dort gibt es Worte, die andere Sprachen in Sätze fassen würden. Beispielsweise heißt Kaipiallrulliniuk soviel wie: „The two of them were apparently really hungry.” Faszinierend.

Numerical Cat

  • Posted on April 2, 2010 at 3:45 pm
Our cat Lucky sitting on the book of the Third Edition of the "Numerical Recipes".

Numerical cat.

I am currently coding a little client/server tool that reports log incidents via encrypted e-mail to a master server. After a night of wading through code and implementing the RSA/AES encryption and decryption I woke up, saw our cat on the bed and wondered if cats also master the art of encryption. I was too tired to break this thought, so I asked myself for minutes „Are cats capable of handling RSA and if so what’s the maximum key size they can manage?”. Lucky wasn’t impressed. She wanted food.

Today she took advantage of the Numerical Recipes book lying on the table. Usually she lies on the floor, doing her daily sunbath. Not this time; as you can see from the image she is pondering a very difficult numerical problem. Feel free to add the standard LOLCAT text. I have the picture in bigger resolutions, too.

Thoughts about fsync() and caching

  • Posted on March 19, 2010 at 10:54 pm

I am currently reading stuff about a talk about caching and how developers (or sysadmins) reliably get data from memory to disk. I found this gem I want to share with you.

fsync on Mac OS X: Since on Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory — and in the event of system failure you risk data corruption.

It’s from the old Firefox-hangs-for-30-seconds-on-some-systems-problem, described in the fsyncers and curveballs posting. Did you catch the first sentence? „Since on Mac OS X the fsync command does not make the guarantee that bytes are written”. This is a nice one, especially if programmers think that fsync() really flushes some buffers. It doesn’t always do that. And in case you want to be deprived of sleep, go and read the wonderful presentation titled Eat my data. It’s worth it.

Seriously Debugging the Text Indexer Code

  • Posted on February 28, 2010 at 4:55 pm

After feeling like wading in honey during the past weeks I finally get around to squash some bugs in my text indexer code. The first one was the obligatory use of a null pointer in rare cases. I know, this should never happen. Found it, squashed it. Won’t happen again (I am pretty confident about this).

The next problem was a wrong string comparison when dealing with file extensions. Ignoring the “.” leads to match of “ps” and “props”. The latter is no PostScript® file and cannot be indexed (well, it can be, but it shouldn’t). “.” are from now on never ignored.

The test data consists of 3755 files. After filtering 648 documents remain (file extensions .doc, .htm, .html, .odp, .ods, .odt, .ps, .pdf, .php, .rtf, .txt, .xml, .xls). The results are indexed by means of the PostgreSQL text index function. The resulting database has a table size of 488 kiB (23 MiB documents, 19 MiB text index). Indexing works fairly well so far. The database should be more than sufficient for testing the front end. I’ll probably have a go at the content of the two Cryptome.org DVDs I ordered a couple of weeks ago. Both DVDs contain 42914 files in 1106 directories. The total size is over 8 GiB. Maybe I publish the front end URL to the indexed Cryptome data. Let’s see.

The Joy of High Level Languages

  • Posted on February 16, 2010 at 6:06 pm

When programming you should use a high level programming language. This is important since you do not ever again have to deal with the intricacies of the platform you are working on. Coding becomes paradise. And the Earth is flat, and pigs can fly. I’ve spent over ten days tracking down a problem of Awstats stopping to update the web statistics. The configuration was copied from the old server, as were all the logs, the previous configurations, everything. Yet Awstats did not generate new statistics.

Finally I found an unsuspecting line in the logs. It went: „Warning: Error while retrieving hashfile: Byte order is not compatible at ../../lib/Storable.pm” It’s just a warning, so it’s nothing to be worried about, right? And since we use a high level language surely the change from 32 bit to 64 bit Debian cannot make a difference, right? We code in high level, we do not deal with byte orders and other wordly stuff anymore. We are enlightened. And obviously we are fucked. Thanks to a hint on a blog somewhere the web statistics are working again.

I will continue my text indexer project today. It’s written in C++.

We have Dragons in the office!

  • Posted on December 28, 2009 at 9:21 pm

We’re sitting in the office and watch the streams from 26C3. Now that’s what I call cool! The streams are quite stable (except for the rush hours).

Speaking of dragons, I just upgrade the main virtualisation server to Linux kernel 2.6.32.2 and qemu-kvm 0.12.1.1. Hooray! In addition the main web server was upgraded from Debian 4.0 to Debian 5.0. It worked like a charm! That’s what I like about Debian.

Back to the dragons! Shhhh!

Top