Posts tagged with 'Development'

Growing up in a Hacker Space without knowing about it

  • Posted on April 11, 2012 at 2:26 am

This blog posting is a bit different from all the others. Usually it”s more about sarcasm or bashing things or people. Today it is the complete opposite. If you look at the tag cloud and have read some postings (or tweets for that matter) you probably have realised that I am doing some hacking behind the scenes. Let’s call it tinkering with technology. Basically I learnt a lot because my family allowed me to learn and to develop skills. Let me tell you how this was like.

I’ve always been the curious type. I constantly tried to figure out how things work, even as a child. Most children do that, but I liked to take apart gadgets very early. The curiosity grew intense. My parents and grandparents forbade me to open any household gadget that was new or still in use. Back in the days appliances were repaired, not replace. So one of my chances to get a peek inside was to wait until something broke and a repairman (be it an electrician, a plumber or heating contractor) came to our house. I was happy whenever our TV set was broken, because I got a look inside and could observe what the electrician did. I always kept the circuit diagrams of our device although I couldn’t read them properly yet. Those were part of the manual (I grew up in the age before „intellectual property“ was invented out of thin air, people were still allowed to repair their own possessions back then).

My family recognised my curiosity. I got lots of books. I read them. My grandfather gave his support also by buying science kits for me. One chemistry set, two physics sets and countless of electronic kits found their way to our home. I had lots of electronic components ranging from transistors, coils, transformers, capacitors, LEDs (yes, only the red ones), LED displays, a cathode ray tube, a 10 MHz oscilloscope, soldering iron, cable and countless of other items. First I build the experiments according to the manual (building test circuits up to sound generators, radios and even a simple black/white TV set), then I started to try my own ideas. I could even use my grandfathers work shop in the basement. He was a mechanic, and his work shop had anything – screwdrivers of any size, power drills, files, soldering lamps, paint, solvent, piece of metal, pipes, really anything. And I could use all of these tools whenever I wanted to.

Some Christmas day (guess it was 1984) the electronic kit collection turned digital. My grandfather gave me a BUSCH Elektronik’s Microtronic Computer-System 2090 with a 4-bit TMS 1600 CPU at its core. 4096 Byte ROM, 64Byte + 512 Byte RAM, 40 assembler instructions and 12 commands at the console consisting of 26 keys and a 6-digit LED display greatly enhanced the capabilities of my little lab. I started coding. The series of presents from my grandfather continued with a Commodore C64, a C128 and Amiga 500/2000/4000, not to forget the HP48 calculator I used at university.

I am not writing this down to brag about it, I am well aware that not everyone has been lucky to have a family like this. The point is this: Even when my grandfather gave me the electronic kits he did not understand what I was doing with it. He had a basic understanding of electricity, he could fix electrical wiring in the house, but he never did more complex things. He was a master mechanic, he could build anything out of wood or metal. Despite having no interest in and knowledge of electronics and computing he tried to help me with my education. Growing up with books, hardware, software and a work shop – and with an environment that actively supports curiosity – is one of the best things that can happen to you. That’s what a hacker space is – the best that can happen to you. Cherish it! Support it! Improve it! Create it if it doesn’t exist! And always put the tools back to where they belong! My grandfather told me this over and over.

Sadly I cannot thank my grandfather any more. He did a couple of years ago. He would have turned 90 today.

If you want to do him a favour, then please create something or understand the workings of Nature. He would have liked it.

XML Parsers considered harmful

  • Posted on July 6, 2010 at 8:08 pm

I am fighting with Xalan and Xerces (in C++). After looking for decent tutorials (there are none) I found this little gem among the Google results. It clearly emphasises what I already know. XML parsers are from Hell. Xalan & Xerces are especially tricky since they’ve been ported from Java. The API is a bit weird. Some things contradict intuition. For example if you initialise the transformation engines more than once per process run, the destructor for the XSLTInputSource crashes with SIGSEGV. You get no clue. You return() and just as the objects get out of scope your program crashes. The secret is hidden in the API documentation. And you cannot easily stop the XSLT transformer from downloading/accessing the document’s DTD. You have to provide your own EntityResolver class that resolves all entities without DTD, if you wish to ignore it. Charming. Bureaucratic. Have I mentioned Java already?

Google result hit for XML parser software with a malware warning.

XML considered harmful.

If you know a decent and light-weight XSLT transformation library, let me know. I just need to delete tags from HTML, XHTML and XML documents (which worked well with regular expressions before). The XSLT template is quite short, and the task isn’t very complicated.

O_NOATIME + NFS = EPERM

  • Posted on June 28, 2010 at 12:26 am

I met a surprise today. I am writing code that accesses a lot of files via NFSv4, stat()s  them, possibly extracts content and writes stuff into a couple of databases. Somewhen in the debug/development cycle a stat() call returned Resource temporarily unavailable (a.k.a. EAGAIN and EWOULDBLOCK). I tried replacing stat() by lstat() and finally by fstat() in order to assert more control over the flags provided to open(). The combination O_RDONLY | O_SYNC | O_NOATIME changed EAGAIN into EPERM (Operation not permitted). Why is that? Well, here’s a hint.

The O_NOATIME flag was specified, but the effective user ID of the caller did not match the owner of the file and the caller was not privileged (CAP_FOWNER).

Correct. I changed the machine the test ran on. This turned the effective UID into something different (the NFS share showed the numerical 4294967294 which is not the UID of the development account). I’d never have expected this behaviour from the description in the man pages…despite the quoted sentence above…which is really part of man 2 open

RTFM. Again.

Text Decoding – Progressing

  • Posted on June 13, 2010 at 9:27 pm

The new document class takes form. I now have a prototype that can extract meta information from the filesystem (not that hard), detect/guess the encoding in case of pure text files (with the help of the ICU library), strip markup language (by replacing all tags) and detect/convert PDF documents (with the help of the PoDoFo PDF library). Converting HTML to text is a bit of a hack in C++. Maybe it is easier to use XSLT and let a XML library do the transformation as a whole. In theory HTML was built for this. However I still need to strip the tags of unknown XML documents in case someone wants to index XML stuff.

I forgot to extract the meta information from the HTML header. RegEx++; Dublin Core or non-standard tags, that is the question.

  • Comments are off for Text Decoding - Progressing
  • Filed under

Textabenteuer

  • Posted on May 18, 2010 at 1:31 pm

Der Mai ist dichter gepackt als ich dachte. Vorletzte Woche habe ich bei den Wiener Linuxwochen im Alten Rathaus verbracht. GNU/Linux gepaart mit barocker Architektur sieht man nicht alle Tage. Aus Denkmalschutzgründen gab es daher nur ein Funknetzwerk. Jetzt widme ich mich wieder anderen Problemen und viel Text – in Form von Code und eigentlichem Text. Ich teste CLucene und den PostgreSQL Textindizierer an Dokumenten aus dem „echten” Leben. Die Fragen, die sich dabei aufwerfen, sind schwieriger zu beantworten als es die Dokumentation erahnen läßt.

Zuerst muß man mal auf den eigentlich Text kommen. Es gibt einen Haufen von Dokumentformaten – .doc, .pdf, .odt, .html, .xml, .rtf, .odp, .xls, .txt, .ps, … . Diese muß man zuerst normalisieren bevor man sie indizieren kann. Man benötigt den puren Text, die einzelnen Worte, und sonst nichts. Obendrein sollte die Kodierung der Zeichen einheitlich sein. Es bietet sich UTF-8 an, also muß man ausreichend Konverter haben. Da einige der Dokumentenformate proprietär oder einfach nur schlecht entworfen sind, ist das keine triviale Aufgabe. Ich habe genug Konverter gefunden, aber einige sind besser als andere. Die Tests an den Dokumentensammlungen werden zeigen wie gut sie wirklich sind.

Dann kommt die Sprache. Das Indizieren von Text reduziert die darin vorkommenden Worte auf ihre Stammform und entfernt Stopworte. Beides hängt von der Sprache des Dokuments ab. Nun wird die Sprache leider nicht in allen Formaten als Metainformation mitgegeben. Man muß sie also ermitteln. Dazu kann man sich der Publikation N-Gram-Based Text Categorization bedienen bzw. eine ihrer Implementationen bemühen. Was passiert mit Texten gemischter Sprache?

Die Liste ist lang. Der Code ist C++, und mir fehlt eine schöne, erweiterbare Klasse, die Dokumente einliest, sie in UTF-8 und puren Text normalisiert sowie einige Metainformationen ausliest. Bisher habe ich nichts gefunden, was ich verwenden möchte. Ich werde es selbst mal versuchen. HTML und XML kann ich schon normalisieren. Für PDF empfehle ich die exzellente PoDoFo Bibliothek (die Spenden von mir bekommen wird). Für den Rest suche ich noch.

Apropos Worte: Kennt wer die Sprache Yup’ik? Sie wird von sehr wenigen Inuit in Alaska und dergleichen gesprochen. Dort gibt es Worte, die andere Sprachen in Sätze fassen würden. Beispielsweise heißt Kaipiallrulliniuk soviel wie: „The two of them were apparently really hungry.” Faszinierend.

Captain Steve übt Neusprech

  • Posted on April 30, 2010 at 1:18 pm

Captain Steve™ hat Adobe® öffentlich verteufelt. Er schrieb über Adobe® Flash, und das hört sich so an.

Adobe’s Flash products are 100% proprietary. They are only available from Adobe, and Adobe has sole authority as to their future enhancement, pricing, etc. While Adobe’s Flash products are widely available, this does not mean they are open, since they are controlled entirely by Adobe and available only from Adobe. By almost any definition, Flash is a closed system.

Was er eigentlich sagen wollte war dies:

Apple’s products are 100% proprietary. They are only available from Apple, and Apple has sole authority as to their future enhancement, pricing, etc. While Apple’s products are widely available, this does not mean they are open, since they are controlled entirely by Apple and available only from Apple. By almost any definition, Apple’s iPad, iPhone and soon OS X is a closed system.

Ich glaube da hat sich wer bei der Meldung nur geirrt. Nachdem Apple nun lizenztechnisch C# und andere Werkzeuge verboten hat, kann es mit der Offenheit nicht weit her sein.

Update: El Reg hat eine sehr treffende Analyse des Ausbruchs. Es geht rein um einen Kontrollwahn, mit offenen Plattformen hat Captain Steve nichts zu tun.

Thoughts about fsync() and caching

  • Posted on March 19, 2010 at 10:54 pm

I am currently reading stuff about a talk about caching and how developers (or sysadmins) reliably get data from memory to disk. I found this gem I want to share with you.

fsync on Mac OS X: Since on Mac OS X the fsync command does not make the guarantee that bytes are written, SQLite sends a F_FULLFSYNC request to the kernel to ensures that the bytes are actually written through to the drive platter. This causes the kernel to flush all buffers to the drives and causes the drives to flush their track caches. Without this, there is a significantly large window of time within which data will reside in volatile memory — and in the event of system failure you risk data corruption.

It’s from the old Firefox-hangs-for-30-seconds-on-some-systems-problem, described in the fsyncers and curveballs posting. Did you catch the first sentence? „Since on Mac OS X the fsync command does not make the guarantee that bytes are written”. This is a nice one, especially if programmers think that fsync() really flushes some buffers. It doesn’t always do that. And in case you want to be deprived of sleep, go and read the wonderful presentation titled Eat my data. It’s worth it.

Seriously Debugging the Text Indexer Code

  • Posted on February 28, 2010 at 4:55 pm

After feeling like wading in honey during the past weeks I finally get around to squash some bugs in my text indexer code. The first one was the obligatory use of a null pointer in rare cases. I know, this should never happen. Found it, squashed it. Won’t happen again (I am pretty confident about this).

The next problem was a wrong string comparison when dealing with file extensions. Ignoring the “.” leads to match of “ps” and “props”. The latter is no PostScript® file and cannot be indexed (well, it can be, but it shouldn’t). “.” are from now on never ignored.

The test data consists of 3755 files. After filtering 648 documents remain (file extensions .doc, .htm, .html, .odp, .ods, .odt, .ps, .pdf, .php, .rtf, .txt, .xml, .xls). The results are indexed by means of the PostgreSQL text index function. The resulting database has a table size of 488 kiB (23 MiB documents, 19 MiB text index). Indexing works fairly well so far. The database should be more than sufficient for testing the front end. I’ll probably have a go at the content of the two Cryptome.org DVDs I ordered a couple of weeks ago. Both DVDs contain 42914 files in 1106 directories. The total size is over 8 GiB. Maybe I publish the front end URL to the indexed Cryptome data. Let’s see.

  • Comments are off for Seriously Debugging the Text Indexer Code
  • Filed under

We have Dragons in the office!

  • Posted on December 28, 2009 at 9:21 pm

We’re sitting in the office and watch the streams from 26C3. Now that’s what I call cool! The streams are quite stable (except for the rush hours).

Speaking of dragons, I just upgrade the main virtualisation server to Linux kernel 2.6.32.2 and qemu-kvm 0.12.1.1. Hooray! In addition the main web server was upgraded from Debian 4.0 to Debian 5.0. It worked like a charm! That’s what I like about Debian.

Back to the dragons! Shhhh!

Opera 10 – Bedeutung von Beta

  • Posted on August 24, 2009 at 9:30 pm

Ok, ich gestehe, ich benutze ab und zu proprietäre Software. Neben Opera sind das meistens Killerspiele, so wie Windows XP, Adobe Acrobat Reader oder Doom 3 (alle verleiten zum Töten, Opera übrigens auch). Da mich der Opera seit geraumer Zeit nervt, habe ich dann doch mal nach Updates geschaut. Bisher habe ich Version 10.00.4402.gcc4.qt3 verwendet. Das Paket war eine .deb Datei für Debian Lenny. Neugierig wie ich bin, habe ich mal aptitude show opera ausgeführt (noch bevor dem Upgrade). Die letzte Zeile im ausgegebenen Text lautet:

The binaries were built on Fedora Core 4 (Stentz) using gcc-4.0.0.

Gut. Wir haben heute den 24. August 2009. Die Release Notes für Fedora Core 4 stammen aus dem Jahre 2005. Der GCC 4.0.0 wurde am 20. April 2005 herausgebracht (ebenso laut Release Notes). Opera 10beta ist neu. Ich verstehe ja, daß Software manchmal geregelt entwickelt wird, man Rollouts hat, Milliarden von unbekannten Clients unterstützt werden wollen, alte Hardware verwendet wird, unerforschte Legacy Systeme (betreut von Sysadmins mit Peitsche und Hut) den Code ausführen müssen und die eine oder andere Ziege ihr Leben lassen muß, damit man die letzten Heisenbugs findet. Aber warum jagt man den Source ausgerechnet im Weinregal bei den antiken Stücken durch den Compiler? Selbst Debian ist über den GCC 4.0.0 hinweg (Lenny bringt 4.3.2 mit). Worin besteht genau der Test bei Operas Betaversion? Vielleicht stammt das .deb aber auch nur von einer Entwickler-Workstation und wurde so nebenbei herausgepreßt.

Ich habe übrigens gerade 10.00.4537.gcc4.qt3 installiert (auch als .deb für Lenny). Wenn ich die Prozedur wiederhole, dann steht nur mehr folgender Text da:

The binaries were built using gcc 4.

Wahrscheinlich steht dann in der finalen Version nur noch „The binaries were built”. Mir wäre es ja auch peinlich.

  • Comments are off for Opera 10 - Bedeutung von Beta
  • Filed under

Frameworks – waste your day

  • Posted on August 3, 2009 at 11:50 pm

Everyone has a framework or a toolbox to tackle problems with. In an ideal world this simplifies the task at hand. It helps you focus on the problem. It saves you tedious lines of code and functions everyone has written before. My task is to write a simple web application where you basically edit database tables. It’s nothing fancy. It needs authentication and PostgreSQL support. So I checked out PHP Frameworks and started testing.

The site presents a nice table. You can easily see which one supports which feature. Some are very cool. Others have very nice features. The problem is that I don’t need what they offer. They are too big, too complex. I tried six frameworks  and deleted them again. Half of them had lousy tutorials. Almost all of them lacked an easy to understand  framework for authentication (I have the database design ready, I don’t need authentication libraries that have their own idea of the user account backend). Not a single framework enabled me to hack away and create the first pages of the application quickly.

So I guess it’s back to my own libraries and a template engine again. Smarty and pure PHP5 rock. Too bad.

Code Breakfast

  • Posted on April 24, 2009 at 10:51 am

regexp PCRE CPAN GC G1 WLAN USB UMTS HSDPA GPS CLF vector #include “rule_parser.h” main() void NULL INSERT DELETE SELECT FROM VACUUM UPDATE MVCC DWH RDBM TokyoCabinet DBM NDBM QDBM BerkeleyDB SQL WAL TCP UUID Blum-Blum-Shub ISO OpenMP g++ #pragma POSIX HTTPS SSL TLS AES 3DES SHA1 MD5 MD6 OpenSSL /dev/random HEAP RBL DNS CMDBA DWH BI IMAPv4 POP3 REST SOAP ICP PHPSESSID Cookie SHM KRB5 EHLO SMTP STARTTLS AUTH URL XSS CSS XSRF pattern core RSS TLB L2 VPN PEM key value SPK DKIM RSA DNS TXT A PTR SOA return(0);

The devil is in the details.

  • Posted on February 24, 2009 at 1:08 pm

Yes, that’s how it is. I already mentioned the email interface to this blog. It basically works. The postings are imported into the blog software. The problem is that my email client usually wraps lines that are longer than 77 characters. The blog software interprets the wrap as a carriage return. This means that I have to disable wrapping in order to have a free flowing paragraph in the blog. Looks awful, but works. Never mind, it’s just a minor detail.

Text encoding is a passion of mine. I like properly encoded text. When programming I pay attention to text conversion and using a consistent encoding across all parts of my code. Whenever I use special characters outside the US ASCII domain in the subjects of my blog emails, these special characters lead to a premature end of the title. Why is that? Due to the legacy of email standards no part of the email header may contain a character with it’s 8th bit set. Email clients mark specially encoded strings by adding the encoding used and some escape characters (this looks a bit like this: „=?iso-8859-15?Q?KERZENST=C4NDER?=”). Should be no problem, but apparently it is. Never mind, it’s just a minor detail.

The movie „Ghost Dog” springs to my mind. Let me show you why by using a quote from the dialogues (or monologues).

Among the maxims on Lord Naoshige’s wall, there was this one: “Matters of great concern should be treated lightly.” Master Ittei commented, “Matters of small concern should be treated seriously.”

So, let’s take care of some minor details.

  • Comments are off for The devil is in the details.
  • Filed under

Top