Thursday, December 22

Kalba ir kalbininkai

Sorry, English speakers, this post is in Lithuanian; writing it in English would be an oxymoron.

Neseniai užtikau internete tipišką žmogaus, nusistačiusio prieš informatikos terminijos lietuvinimą, puslapį. Tokią nuomonę turi nemažai žmonių, ir kai kurie mėgsta apie tai garsiai pašnekėti internete ar forumuose, tuo tarpu iš kitos barikadų pusės tokių aktyvistų gerokai mažiau. Pabandysiu apginti (nebūčiau Gintautas!) kalbininkų darbą.

Pirmiausia pulsiu lengviausiai paneigiamą teiginį — kad kalba nesvarbi tautiniam identitetui ir kad ji savaime nėra vertybė. Manau, kad vienas iš dalykų, apibrėžiančių tautą, yra jos žmonių mąstysena, o kalba yra tos mąstysenos atspindys. Esmė ta, kad ryšys abipusis — kalba savo ruožtu veikia mąstymą. Nors mintys ir nėra rišlūs sakiniai, vis vien remiamės kalba. Ši idėja tiksliau išreikšta Sapiro-Vorfo hipotezėje. Paprasčiau kalbant, pabandykite įsivaizduoti vokietį, nekalbantį vokiškai. Arba prancūzą, nekalbantį prancūziškai. Japoną, nekalbantį japoniškai. Kiną? Suomį? Indą? Italą? Pagaliau anglą? Manau, kad tai pakankamai stiprus argumentas, kad kalba yra vienas iš pagrindinių, jei ne pats pagrindinis tautiškumo pagrindas ir labai svarbi kultūros dalis (tai, beje, argumentas, kodėl kalbos paskirtis nėra vien informacijos perdavimas). Savimonė taip pat svarbu, be savimonės tauta negali egzistuoti, tačiau savęs supratimas nebūtinai susijęs su tautiškumu. Žmonės kartais save apgaudinėja — jiems atrodo, kad jie skiriasi nuo kitų, nors iš tiesų tų skirtumų labai nedaug.

Manęs visai nežavi idėjos, kad visas pasaulis kalbėtų viena kalba; gerai, kad tai nelabai įmanoma. Būdas su žmonėmis susikalbėti jau egzistuoja senų seniausiai: tiesiog reikia mokytis užsienio kalbų. Aišku, tai nėra paprasta, tačiau ne tik suteikia galimybę bendrauti, tačiau ir apskritai praplečia akiratį.

Dar įmesiu akmenį į airių daržą. Gal anglų kalba jiems ir atnešė ekonominį klestėjimą investicijomis iš užsienio, tačiau kartu atkeliavo ir nelabai gerų dalykų. Jei neklystu, Airija buvo viena iš didžiausių programinės įrangos patentų propaguotojų. Viena iš priežasčių — Airijoje plačiai įsikūręs Microsoft galėjo tampyti politikų virveles ekonominiais argumentais. Taigi airiai praranda ne tik kalbą, bet ir galią savo pačių valstybėje.

Dabar galime pereiti prie, atrodytų, universaliai nekenčiamų kalbininkų. Tie dykaduoniai tik nieko nesuprasdami išradinėja naujus žodžius ir kiša nosį ne į savo reikalus, ar ne? Ne. Mano supratimu, kalbininkų tikslas — išsaugoti lietuvių kalbos grynumą. Manau, kad tai nėra kvailas tikslas, nes lietuvių tauta nėra didelė, ir todėl kitos kalbos (anglų ir rusų) nesunkiai daro įtaką. Aš didžiuojuosi savo kalba, ir nenorėčiau po 50 metų turėti lietrusių ar anglietuvių kalbos. Kalbininkai stengiasi neįsileisti visiškai netinkamų skolinių todėl, kad jei įleidžiame keletą, tada visi pamato, kad taip galima, ir nebelieka priežasties neįleisti ir daugiau. Taip kalba gali pasikeisti smarkiai per trumpą laiką. Gal tai atrodo kaip ėjimas prieš žmones, bet mano galva, žmonės tiesiog linkę rinktis paprastesnį sprendimą nemąstydami apie ilgalaikes pasekmes, kurias svarstyti ir yra kalbininkų (ir apskritai valstybės) pareiga, panašiai kaip rūpinimasis aplinkos tarša.

Nesutinku su populiaria nuomone, kad kalbininkai nesugeba atlikti savo darbo. Pirmiausia, jie yra savo darbo specialistai, ir aš mieliau palikčiau vertimą jiems, negu kokiems nors inžinieriams. Imant pavyzdį iš programų kūrimo konteksto, jei reikia parašyti verslo programą, juk darbą atlieka programuotojas, o ne verslininkas. Programuotojo pareiga yra išnagrinėti dalykinę sritį ir tada dirbti savo darbą. Verslininkas gali galvoti, kad jis daug geriau parašytų programą, nors tai dažniausiai netiesa. Apskritai, dažnai verslininkas net nežino, kokios programos iš tiesų nori. Lygiai taip pat, kalbininko pareiga yra išnagrinėti dalykinę sritį ir tada dirbti savo darbą. Programuotojas gali galvoti, kad jis daug geriau susitvarkytų su terminais, nors tai dažniausiai netiesa. Apskritai, dažnai programuotojas net nežino, kokių terminų iš tiesų nori.

Teisybė, kartais kalbininkai sugalvoja keistų, gal net juokingų žodžių. Kartais jie iš tiesų nepataiko, bet man atrodo, kad dauguma sugalvotų terminų pakankamai geri kasdieniam naudojimui. O žodžių juokingumas yra labai miglota savoka. Pabandykite lėtai, nesigilindami į prasmę, ištęsdami, pasimaivydami ištarti bet kokį paprastą žodį, pavyzdžiui, „piešti“ arba „šluota“. Man po tokios procedūros vos ne bet koks žodis atrodo juokingas, ir ne dėl to, kad juokingai ištariamas -- tas keistas tarimas reikalingas tam, kad būtų galima atskirti žodį nuo jo tiesioginės prasmės ir suvokti tik garsą ir ryšius su kitais žodžiais. Daug naujų žodžių gali atrodyti juokingi dėl deminutyvų („skreitinukas“) arba moteriškos giminės („derintuvė“). Jie atrodo juokingi todėl, kad neįprasti. Čia bėda ta, kad anglų kalba daiktams nepriskiria giminės ir neturi deminutyvinių formų, todėl tokie vertimai į lietuvių kalbą (mes įpratę prie svetimžodžio klijuoti vyriškos giminės galūnę be jokių priesagų) atrodo keisti. Tuo tarpu kasdieniuose žodžiuose tokios formos gana dažnos („mikrobangų krosnelė“, „degtukas“, „mentelė“). Štai ir pavyzdys, kaip skoliniai skurdina kalbą.

Neprotinga šnekėti, kad kalbininkai nori pakeisti profesinį žargoną. Tarpusavyje bendraujantys profesionalai greičiausiai kaip ir šnekėjo, taip ir šnekės pusiau angliškai, pusiau lietuviškai (deja, ir man tenka taip daryti; mielai pabandyčiau naudoti lietuviškus terminus, jei neprieštarautų bendradarbiai). Iš tiesų sprendžiamas svarbus klausimas: ką daryti, kai reikia parašyti apie dalyką knygą, arba išdėstyti medžiagą universitete? Aš manau, kad mano lietuvių kalbos jausmas neblogas, ir man labai nemalonu rasti sudarkytų skolinių profesionaliame tekste. Manau, kad daug lietuvių jaustųsi panašiai.

Galiausiai norėčiau pastebėti įdomią vidinę prieštarą. Dažnai žmonės, besipriešinantys šiems naujiems terminams, šiaip yra progresyvūs, domisi naujomis technologijomis, gyvena modernų gyvenimą ir laiko save atvirais kitų nuomonei. Tuo tarpu priešinimasis šiems terminams, mano supratimu, yra labai konservatyvus veiksmas: „Aš jau naudoju šį terminą tris mėnesius, todėl prašom jo nekeisti, ir apskritai, jūsų alternatyvos man atrodo juokingos.“ Alternatyvos juokingos vien dėl to, kad tai ne tas pats skolinys (kuris, beje, „aiškus“ turbūt todėl, kad akivaizdus ryšys su anglišku atitikmeniu). Pamažu tie žodžiai įeis į bendrąją kalbą ir bus visiškai natūralūs.

Wednesday, November 2

Hassle-free IPv6 connectivity

I have recently discovered that nowadays it's very easy to get IPv6 access if you're using Debian (or Ubuntu):

apt-get install tspc

Try ping6 www.kame.net, it should work out of the box. A 6to4 tunnel should have been automatically established to Hexago, which hosts www.freenet6.net. You don't even have to register for an account, although it might be a good idea to get one if you are really planning to use IPv6, as the anonymous tunnel broker is rather slow (I had ping times on the order of 300-500ms), and you will probably want a statically allocated address space. Moreover, a tunneling protocol based on UDP which supports NAT punching is now available, so you do not need a public IP address to connect. This situation is orders of magnitude better than the one when I tried to set up IPv6 on my router at home a couple years ago.

At the moment there is not much you can do with your new shiny IPv6 connection IPv6 is still "basking" in obscurity, there's not much to see on the IPv6 web yet: dancing kame, some stats about fellow IPv6 users... Nothing really practical. However, given the trivial setup procedure, I'm sure some uses could be found. One that comes to mind immediately is SSH access. Since the tunnel can punch NAT's, even machines behind routers can be reached without any trouble. While 300ms lag is not pleasant, there is also some security through obscurity to be gained - who would bother scanning 128bit IP addresses? Of course, if you completely disable SSH on IPv4, a tunnel failure could be very unpleasant. There is still some sense having both running. For example, you can restrict SSH access to several static IPv4 addresses that you usually use; you can connect from external sites using IPv6. Besides, it's quite nice to have the ability to fix those nasty firewall setup slips when you lock yourself out by accident (that has happened to me several times).

While IPv4 is not giving way to IPv6 yet, a fair amount of software and hardware already supports the new standard. A concise overview of the advantages can be found at NetBSD IPv6 FAQ.

Sunday, October 16

Design by Contract in Python

In the light of recent lectures on Design by Contract in the university I decided see if there are any nice Python implementations.

To begin with, Design by Contract is a technique to increase reliability of programs, and in particular for reusable components. Invented by Bertrand Meyer more than ten years ago, it is a central notion in the programming language Eiffel that he created. There is a nice introduction to Design by Contract available. I'll try to summarize it very quickly.

Basically, in the context of software engineering a contract is a collection of obligations for a component. These obligations are divided into three major categories:

  • Class invariants are supposed to be (almost) always valid. For example, the attribute ISBN of a class Book should always be valid according to the ISBN checksum algorithm. A possible invariant for a class Window would be "not (window.maximized and window.minimized".
  • Preconditions are input conditions, they validate the arguments of a method or function call. For example, a precondition for the function sqrt(x) could be that its argument must be non-negative (assuming that it cannot deal with complex numbers). The caller of the function is responsible for supplying arguments that don't violate preconditions.
  • Postconditions are result obligations. They validate the result of a function/method and state changes of an object. An example of a postcondition for sqrt(): "abs(result * result - x) < 0.001". Postconditions can also access attributes and ensure that they have (or have not) been changed appropriately.

DBC might not look very pleasant for pythoneers at first glance, because it makes programs more rigid. Besides, unit and integration tests should catch all the problems, right? Well, not quite. I have experienced numerous bugs caused by violations of informally (if at all) described invariants. The problem with unit tests is that when you find a bug where one component (the "client") uses another component (the "server") incorrectly, you can fix the client and test for regressions in that client, but you cannot ensure that other clients have the same problem. Well, you can add assertion statements on the server, but I'm not particulary fond of assert statements as they clutter up the code and they are not suitable for checking a set of invariants in multiple locations. A nice thing about DBC is that, unlike static typing, which is an all-or-nothing affair, you can go only as far as you wish.

Preconditions in particular can be very beneficial to beginners. I have heard at least several people complain about dynamic typing in Python because they were frustrated by mysterious errors coming from the depths of a large framework which turned out to be a result of incorrect API usage. Comprehensive preconditions can eliminate such problems much better than any static typing system.

In Zope 3 applications that I worked with, a fair number of assumptions was supposed to hold at all times, they were mostly mentioned in interface docstrings, but only occasionally checked in actual code, and certainly not systematically. We've had a fair share of problems with objects referencing other objects that should have been deleted ("hanging in the air") and with processing inconsistent data structures. In relational databases, such requirements could be checked using constraints and triggers, but I am not aware of similar mechanisms for ZODB. In the end we bolted on a component that would swoop through all the objects in our application and check various things, but this component had to be invoked externally, using a cron job or manually pointing the browser at a particular URL. However, this was still far from perfect, because the component was separate, so requirements for components were not localized adjacent to the corresponding code. Using the Zope 3 component architecture would help with that, but it would increase the overhead for adding checks.

A possible complication with DBC in Python is the performance hit taken by all the extra checks. For compiled languages it's not nearly as bad as in Python where running unit-tests already takes minutes for larger applications. However, Python has always favoured convenience and doing the right thing over performance, so, given the benefits, this is not against the ideology at all. I do not like the current situation in zope.interface where if you want something checked against the interface, you have to run the check manually.

Of the Python implementations of DBC, I liked Contracts for Python most. In fact, its author Terence Way even proposed to include DBC in Python (see PEP-316), but the PEP was deferred. The general idea of the implementation is to declare the contract in docstrings, similar to doctest. This is much more lightweight and convenient than the other approaches which require to inherit from special classes (or set metaclasses) and/or define new special methods. Here is an example of a contract from the implementation's homepage:

def sort(a):
    """Sort a list *IN PLACE*.

    pre:
        # must be a list
        isinstance(a, list)

        # all elements must be comparable with all other items
        forall(range(len(a)),
               lambda i: forall(range(len(a)),
                                lambda j: (a[i] < a[j]) ^ (a[i] >= a[j])))

    post[a]:
        # length of array is unchanged
        len(a) == len(__old__.a)

        # all elements given are still in the array
        forall(__old__.a, lambda e: __old__.a.count(e) == a.count(e))

        # the array is sorted
        forall([a[i] >= a[i-1] for i in range(1, len(a))])
    """

There is not much point in reiterating the concise documentation found in the package. You can find more examples on the "Contracts for Python" homepage.

It is unfortunate that the implementation is a bit stale, last touched more than two years ago as of today. I managed to find a trivial bug (patch) related to importing packages to be processed, but otherwise it seems to be still working fine. I also tried to make it work properly with Zope interfaces, so that implemented interfaces are treated as superclasses. Problem is, interfaces are not quite ordinary classes, so there were some complications. I might look further into it if anyone is interested. By the way, speaking about Zope interfaces, they already include support for invariants (see zope.interface.invariant), although I prefer the "Contracts for Python" way. Furthermore, preconditions and postconditions are not supported.

Saturday, August 20

Zope 3 views reloaded

One of the few things I dislike most about Zope 3 is the time it takes to start the server, which becomes an annoyance when running functional tests or when tinkering with view code and checking the results in the browser. It's so refreshing sometimes to work with page templates where the a browser reload is enough. I would frequently think how useful it would be if at least the views behaved the same too.

A little bit of peeking in Zope 3 code and a little bit of coding, and we've got a package called z3reload (tarball) that does exactly that, reloads the view code before each render. This only really works for views, but in my experience views tend to be the largest and most complex part of the code, so writing them consumes the bulk of the time programming, and they need manual testing most (unit tests are usually enough for the model classes).

Installation of z3reload is relatively simple: just drop some files in Zope3 package-includes, and specify which views to patch in z3reload-configure.zcml. Be sure to look through the README file.

z3reload might come in handy to speed up functional testing iterations as well, if you use the script runfdoctests.py by my co-worker Marius Gedminas (I hope he doesn't mind me publishing this script). Drop it into your Zope 3 directory and run it. This script runs the Zope 3 initialization routines just once, and then you can edit functional doctests and immediately run them without the overhead of reinitializing everything. The script is already a great timesaver. Thing is, now with z3reload you not only get to modify the functional test source without having to reinitialize Zope, but you can edit the view code as well.

It is a bit puzzling to me why such a thing has not been done before (well, at least I was not able to find it). I have heard there was something similar in Zope 2, and now Zope 3 people seem to have the notion that code reloading is not worth it, it's too complex and bug-prone to implement. Sure, my version is very limited, but I think it has significant productivity advantages.

Beware, I have not used or tested z3reload much. It's quite possible that it breaks horribly in some typical circumstances. Needless to say, don't use it on production servers or precious databases.

For my co-workers stuck with an old revision of Zope 3, from the times Zope 3 had services, I have put up a backported version of z3reload.

Monday, July 11

EuroPython 2005, continued

I promised to review some of the talks that I attended. I will cover those which were most memorable or useful.

Kit Blake presented the Document Library, which is basically a document archival application. What's interesting is that it's built using Zope 3 technologies. I think that this need for this type of applications is going to grow and it's nice to have a free implementation. Besides, since it's in Zope 3, it should be easy to integrate into other projects.

AlphaFlow, presented by Christian Theune, revealed to me how libraries can help manage workflow. Besides, I followed a link from AlphaFlow's webpage and found Workflow Patterns, which has some nice research on workflow.

Edward K. Ream, an impressive person by himself, introduced Leo, which is similar to an outliner, but much more powerful. It is not very easy to precisely define Leo because it is a generic tool to manage information. A nice introduction is on the What Is Leo page. I was quite fascinated with Leo, but found it to be imperfect in some regards. Maybe it's just me, but the Tkinter GUI looks outdated and the whole application does not look very polished. Leo is also very ambitious and provides its own editor, which I don't think I'd like to switch to, especially since I'm a vim user. It does provide some Emacs-like shortcuts. It has some novel ideas, but I ran out of patience trying to find an efficient way to work with it. I also did not want to lose many very useful tools that work on the standard filesystem and integrate with vim/Emacs. It also seems to me that versioning should be in the big picture too. It's obviously a hard problem though.

Two talks by Theo de Ridder were energizing and even frenetic. It's obvious that he is an extremely smart person, but most people from the audience lost the thread about halfway through his talks, myself included. The fact that I think it's my fault shows that the presentations were excellent in other regards.

Kamaelia, presented by Michael Sparks, is a framework for asynchronous applications constructed from components linked via communication channels. The nice thing is that Python generators are used to achieve multitasking, which makes components much easier to write. It was interesting to hear that the best way they found to design parallel systems was to first write a simple modularized version and parallelise afterwards. I liked the basic idea of small components talking together to do complex tasks. It's like a simple and elegant application of parallelization to component-based architectures, which are more frequently built as libraries or have much more complex synchronization mechanisms. I was a bit disappointed by the state of the project (they don't even have Kamaelia packaged for easy download...).

Both Steve Alexander's talks that I attended were brilliant as usual. In particular the talk about Zope 3 security was interesting to me. The slides elucidated the Zope 3 security mechanisms and offered a nice use case in Launchpad. Even though the presentation took 60 minutes, it didn't seem long at all (unlike the other one, about Leo, which could easily have been compressed).

Tommi Virtanen talked about Twisted news. We had used Twisted in SchoolTool previously and it's possible that we will be using it in the future (there is, or at least was, talk of Zope 3 integration with Twisted). It was nice to hear about all the new things being worked on.

The talk on PyPy by its main contributors Armin Rigo, Holger Krekel, Christian Tismer and Carl Friedrich Holz was lucid and helped understand the structure and goals of PyPy. I was impressed by their early demonstrations of simple Python code compilation to C, giving a tenfold speed increase. They have gone a fair way since the last Europython, where I attended their presentation on PyPy too. In fact, I participated in the PyPy sprint after the conference, I helped a bit with the standard library update. PyPy uses the CPython standard library, but there were a few issues to be resolved to update the library to that in Python 2.4.

Michael Hudson talked about Recoverable Exceptions. He was being a bit too apologetic, but the talk came out OK, at least it was interesting for me. He talked about smart handling of exceptions, that is, other behaviours than just aborting to the nearest except block without a way back when an exception is raised. While I'm currently quite satisfied with the standard exceptions that Python has, it's some food for thought.

Armin Rigo talked about greenlets, coroutines in Python. They are based on Python generators and use some C-level magic stack manipulation. The presentation helped me finally understand what PEP 342 is all about. Now that Guido van Rossum has accepted PEP 342 and I believe the implementation has been checked in, most of what greenlets offer will be in the Python core eventually.

Web Application testing using Selenium was not particularly impressive, perhaps because I knew most of the things being presented. The idea and technology behind Selenium itself is sound and very practical though. If you are doing web application development, be sure to have a close look.

Michael Salib is another one of those energetic speakers that deliver memorable talks. In this Europython he talked about Xapwrap, a wrapper for the Xapian text indexing library, and q2q, a peer-to-peer connection management protocol. The former might come out a viable alternative for Lucene in some cases. q2q is rather immature at the moment but it seems to be solving the right problems the right way. It's a pity that slides for the q2q presentation don't seem to be available, they were hilarious.

I spent most of my time on the Python lightning talks track rather than the Zope lightning talks. Python talks were so much more fun. I forgot most of the content, but I had a great time. I actually did a lightning talk about darcs, David Roundy's Revision Control System, which I found very useful. Michael Sparks, who spoke about Kamaelia, has a very nice summary of Python lightning talks. By the way, to answer his doubts about my talk: yes, darcs is not written in Python, but it's great for managing Python code as any other, a very useful tool for any developer. Steve Alexander did a talk on bazaar-NG, which also looks very promising, but at the moment in my opinion darcs is much more mature and usable.

Most of the descriptions were rather terse, you will have to excuse me for that. There is just too much to cover. Reinout van Rees has a more thorough overview of Europython 2005 talks that he attended (there is not much overlap with my list), you should have a look there too.

Thursday, June 30

Europython 2005 report

Europython 2005 is now officially over.

I did a talk on Monday that covered gtktest, a small collection of helpers to make unit-testing pyGTK applications easy. Check out the slides (PDF).

I also delivered a lightning talk on darcs, which is a very well-designed revision control system. Slides (PDF) are available. I use darcs for managing gtktest code, and it has been great so far, much more pleasant to use that Arch and much more powerful than Subversion which I use at work.

I have compiled a list of talks that I attended. You can find a description of each talk on the Europython website, but I was too lazy to add direct hyperlinks on each and every talk. I did include links to day timetables, where you can find complete lists of talks (with hyperlinks to descriptions and slides).

Monday (schedule)

  • The art of giving a talk (Hellwig)
  • Architecture of a large Zope 3 system (Alexander)
  • A Python Framework for Rapid Application Development (Goodwin, Wrigley)
  • Document Library (Blake)
  • AlphaFlow (Theune)
  • MayaVi2 (Ramachandran)
  • The world according to Leo (Ream)
  • Teaching computational engineering (Fangohr)
  • Enabling bare Python as universal connector for ad-hoc networks (de Ridder)
  • Pulling Java Lucene into Python: PyLucene (Vajda)

Tuesday (schedule)

  • Kamaelia (Sparks, Lord)
  • Complex security with Zope 3 and an RDB (Alexander)
  • Twisted news (Virtanen)
  • PyPy as a compiler (Bolz, Krekel, Tismer, Rigo)
  • Recoverable Exceptions In Python (Hudson)
  • Greenlets: coroutines aren't stranger than generators (Rigo)
  • Where metaclasses surpass decorators (de Ridder)
  • Solving puzzles with Python (Niemeyer)

Wednesday (schedule)

  • Stupidity and laser cat toys: Indexing the US Patent Database with Xapian and Twisted (Salib)
  • The Python revolution in the publishing industry (Masini)
  • ItsATree - creating a multimedia editor (Gietz)
  • Web Application Testing with Selenium (Roeder, Roeder)
  • WYSIWYG interface design with CPSSkins and CPSPortlets (Orliaguet, Anguenot)
  • The Personal Internet Endpoint: Using Python and Twisted to write Reliable Peer-To-Peer Programs (Salib)
  • Lots of Python lightning talks (I did one too)

I will cover the talks that I liked best later.

Tuesday, June 21

Back again

It's been quite a while since my last post. Now, having finished my exam session, I hope to resume regular posting.

So, what's been up lately? Well, just a few days ago version 1.1.1 of the calendaring server SchoolBell, free software that I'm working on, has been released. I am fairly satisfied with the reliability of this version as it fixes most problems that surfaced in SchoolBell 1.1, which was tested by quite a few people. SchoolBell will not be developed further on its own in the near future, instead we will be working on SchoolTool Calendar (version 0.10 released recently), which is an extension for SchoolBell that accommodates some education-related use cases. You can find some nice screenshots on the webpage and in the "screencasts" for SchoolTool and SchoolBell.

In other news: I'm leaving for EuroPython in Sweden on Saturday with my colleagues. I will be doing a presentation there on unit-testing PyGTK applications, which is not quite my area of expertise, but, well, they accepted the talk, so they can't complain. The code behind the idea still needs a lot work but I intend to clean it up a little bit and make it public before I leave.

Wednesday, April 6

SpamBayes

I was fed up with the low, but increasingly annoying flow of spam into my mailboxes, so I have finally decided to set up a spam filter. I chose SpamBayes, as I had heard some good things about it (besides, it's written in Python).

As I use Debian (unstable), installation was just an apt-get install spambayes away. Setup and integration with Evolution, my mail client, was a bit more tedious. (By the way, SpamAssassin might have been a sensible choice, as it integrates well with Evolution.) First, I investigated the approach of using SpamBayes by piping messages to one of the SpamBayes scripts, then I even found a script (sb_evoscore.py) that is for use with Evolution specifically, but these solutions had a few drawbacks, so in the end I settled with the standard proxy server approach.

The user interface of the SpamBayes server impressed me. The server sports a simple web server which you can use to configure SpamBayes, review messages, train the filter or view statistics. Configuration of the server was straightforward.

The server is started by running the script sb_server.py, residing in /usr/bin, so it should be in your path. I was slightly annoyed by the fact that the script would immediately litter the working directory with files, and that it had no way of daemonizing, i.e., detaching from the terminal. I created the directory .spambayes in my home directory for storing the SpamBayes database. To run the server automatically, I whipped up a simple init script. It runs sb_server.py in the background as the specified user (just one user though - this will not work for a multiuser system with several people running the SpamBayes server). You will need to create /etc/default/spambayes where the variables DBDIR (the directory for the databases) and RUNAS (the name of the user) would be specified, e.g.:

DBDIR=/home/gintas/.spambayes
RUNAS=gintas

I have not yet figured out why, but after changing the network the SpamBayes server would sometimes wedge up and refuse to connect to a POP3 server because it could not resolve the domain name. For now, I added /etc/init.d/spambayes restart to my suspend script as a workaround.

As I had anticipated setting up a Bayesian spam filter, I have been marking my mail as spam in Evolution rather than simply deleting it for a while. However, when I wanted to train the filter, I could not find the spam folder on my filesystem (Evolution stores mail in the mbox format, in ~/.evolution/mail/local). My first try at training the filter was to simply copy the contents of the Spam folder to a temporary mail folder, which would show up as a file, and feed that as "spam" to SpamBayes, and the other mailboxes as "ham". However, I noticed that the filter didn't work well. Then I found out why the Spam mail folder was not showing up as a file - Spam is actually a virtual folder, and when a message is marked as spam, it is simply hidden from the view rather than moved to a different mailbox. It makes some sense - in case you change your mind about the message, you don't have to know where it came from, it will appear where it was. Therefore the spam training went fine, but supplying the "good" mailboxes was a mistake, because they included the spam too. In the end I had to create another temporary mail folder and copy some good messages to it, and use that one to train the filter.

Wiring up Evolution to use the proxy was easy. I had to change the POP3 server settings in my Evolution accounts to point to localhost:proxied_port as the server, so that Evolution would get messages with the spam indication headers. To use the filtering, I added two rules, one for messages tagged as spam by SpamBayes, and another one for "unsure" (the tag can be found in the header "X-SpamBayes-Classification"). I set the former one to give the message Spam status and mark it read, so I wouldn't even notice it, and the latter to mark the message as Spam but leave it unread, so that I would have a look at it before discarding it. These rules suit me well, as I have never had a false positive, and most of the "unsure" messages (21 out of 24) are spam.

After you train SpamBayes, remember to run a sanity check by querying some common "ham" / "spam" words - that's how I discovered my blunder. Such words as "money", "rich" should show up as spam content. As for ham content, you know best what words are most frequent in your emails (in my case "python" was a clear shot at 87 ham messages vs. 0 spam messages).

Further training of SpamBayes is performed either by reviewing the messages through the web, or running a proxy for the outgoing mail server so that you can send emails to fictitious addresses used for notifying SpamBayes about mistakes. I went for the former. Now once in a while I visit the message review page to classify the unsure messages, though even that is probably unnecessary, as SpamBayes should be chugging along well enough with the existing database.

In conclusion, with zero positives, zero negatives and just a handful of unsure messages to date, I'm quite satisfied with SpamBayes. I had tried to look around on the web for information on using Evolution with SpamBayes and found very little, so I hope that this article will turn out to be useful to someone.

Monday, March 21

Poetry in Translation

I have recently discovered a very funny page which abuses Google translation services to get really funny results. Ever found oddly simple insights in automatic translation? Well, there's a fair dose of them in Poetry in Translation, which translates English to German to French to German to English. The comments section is a bit indecent, but there are some outright hilarious results:

  • "I have a broken heart" -> "I have a defective heart"
  • "Get busy living, get busy dying" -> "If you receive a life employed, you receive death employed"
  • "The quick brown fox jumped over the lazy dog." -> "the fast brown fox jumped on the putrefied dog."
  • "Just die, why don't you?" -> "Cubic straight lines, why not him?"
  • "you won't fool the children of the revolution" -> "children of rotation tromp you"
  • "the pen is mightier" -> "the feather is more powerful"
  • "Oh how i love my girl" -> "How how my girl is expensive!"
  • "One for all, all for one" -> "For all, all"
  • "Be Afraid, Very Afraid" -> "Have fear very timidly,"

Hard drive benchmarking

Marius Gedminas experimented a bit with hdparm and his results showed no difference in linear disk read rate at the beginning of the disk as compared to at the end. That made me curious. I whipped up a very simple Python script to time some plain reads from the disk and the results are consistent with ones from dd, and with another benchmark of my new drive. Yes, using Python for benchmarking is a stupid idea, and the results are not stable, but I do consistently get almost 40MB/s at the start and no more than 28MB/s at the end of the disk. You can try the script for yourself:

#!/usr/bin/env python

import sys
import time

MiB = 2**20

BLOCKS = 128 # number of megabytes to read at a time
SPACING = 4 * 1024 # number of megabytes to seek forward


if len(sys.argv) < 2:
    print "You must supply a device (e.g., '/dev/hda') as an argument"
    sys.exit(1)
try:
    f = open(sys.argv[1], 'r')
except IOError, e:
    if e.errno == 13:
        print ("Permission denied to read the device, you"
               " may need root privileges")
        sys.exit(1)

offset = 0
while True:
    start = time.time()
    try:
        f.seek(offset * MiB)
    except IOError:
        break # We probably hit the end of the disk
    for i in range(BLOCKS):
        f.read(MiB)
    delta = time.time() - start
    rate = BLOCKS / delta
    print 'Offset: %d GB, read rate: %3.2f MB/s' % (offset / 1024, rate)
    offset += SPACING

print 'Finished.'

Monday, March 14

The Commonly Confused Words Test

English Genius
You scored 100% Beginner, 93% Intermediate, 93% Advanced, and 88% Expert!
You did so extremely well, even I can't find a word to describe your excellence! You have the uncommon intelligence necessary to understand things that most people don't. You have an extensive vocabulary, and you're not afraid to use it properly! Way to go!

Thank you so much for taking my test. I hope you enjoyed it!

For the complete Answer Key, visit my blog: http://shortredhead78.blogspot.com/.




My test tracked 4 variables How you compared to other people your age and gender:
You scored higher than 77% on Beginner
You scored higher than 39% on Intermediate
You scored higher than 61% on Advanced
You scored higher than 97% on Expert
Link: The Commonly Confused Words Test written by shortredhead78 on Ok Cupid

New hard disk for my laptop

Just a few days ago I bought a new 7200rpm Hitachi hard disk (Travelstar E7K60) to replace my old 4200rpm one by Toshiba that came together with the laptop. Now I really regret that I did not do this earlier. This is easily the best investment in a laptop's performance, unless you have a really old CPU or less than 256MB RAM.

The speed increase is very noticeable. Bootup now takes only half the time it used to, and applications start significantly faster. Seeks are more silent in the new drive, and I have not noticed any background noise because of the increased rotational speed. Battery usage has not changed at all. In general, I noticed only improvements and no regressions after upgrading.

While partitioning the new disk, I noticed that in my old Toshiba drive the root (/) partition was located at the very end of the disk, because of hysterical raisins. I did not have a separate partition for /usr, so its contents were there too. Make sure not to make my mistake of putting frequently accessed data at the end: hard disks are usually faster at the start. This is because the rotational speed of the disk is constant, but the circumference of outer tracks is larger, therefore, if the data density is uniform over the disk, the transfer rate is greater.

Speed of some drives may be more sensitive to track diameter than others. If you are curious, you can do a quick benchmark. My new one shows about 38MB/s linear read speed on the outer tracks (start of disk) and about 27MB/s on the inner ones (end of disk), a quite significant difference. To get these numbers, I used this command on Linux: sudo dd if=/dev/hda of=/dev/null bs=1M count=1024 skip=0. It reads a gigabyte of data from a given offset in the disk, the operation should take about half a minute. dd even counts the transfer rate for you. To measure performance on the inner tracks, adjust the skip parameter (e.g., about 37000 for a 40GB drive). You might want to repeat the command a few times and average the results. Do not pick an amount of data (the count parameter) less than twice your RAM, because the results may be skewed because of caching.

Sunday, March 6

One-button testing

I wrote about some inefficiencies in my procedure of running tests a while ago. I really dislike the repetitiveness of commands to switch context, as I use the vim editor for editing code and a terminal for running the tests. Had it been code, I would have refactored it long ago. Now I decided that it's time to optimize this part a little bit.

My very first idea which I had come up with long ago was to write a small script called loop, which would loop a command given as an argument infinitely, waiting for a keypress between runs. The script was an extremely simple one-liner:

while true; do $@ ; read; done
However, it only helped me with losing the Up,Enter habit a little bit, as Enter would be sufficient.

The second go at the problem was on the right track. I decided to write a small Python script to behave much like the loop script, but it would register a global shortcut handler so that I could do an iteration without having to switch to the terminal.

The idea of handling global shortcuts was OK, but the implementation gave me some pains. I tried to look around for some examples of registering global shortcuts with GNOME, but found nothing really useful. I then remembered that a multitude of KDE apps register global shortcuts, and decided to try the KDE Python bindings. In the end I wasted several hours scouring the web for information and watching my app segfault because of odd reasons. It took me a long while to get the details mostly right, and because of reentrancy problems I managed to wedge my keyboard completely so that I had to login remotely and kill the Python process to get control back. I did not quite like having to load KDE libraries either, which took a whole second to import on script startup. In the end I dumped this solution for a more simplistic approach.

After playing with the KDE shortcuts for a while, I finally understood that I don't really need a global shortcut, as 99% of the time I need to run the tests while I am working in Vim. This allowed for a much simpler system. The loop script has remained, but has evolved significantly from the one-liner. Notably, it now checks the return status of the executed command and prints a red or a green horizontal bar with a timestamp. It is very nice to have coloured feedback on whether the tests have passed. There is also an option to run an alternate hardcoded command.

I implemented interprocess communication in a slightly hacky way, by having the loop script invoke another shell script for user input (this way I sidestepped smart signal handling in a shell script). The client, in our case vim, can then send a signal to the primitive sub-script with a simple killall command. Using the process name as a unique identifier is not very clean, but good enough for me.

So, there are three scripts in total:

  • loop (to be used from the command line)
  • dumbass.sh (the stupid sub-process, only used internally)
  • dumbass-kick.sh, which abstracts the killall command in case I want to implement it in a cleaner way. It accepts '1' or '2' as an argument. If you pass '2', loop runs the alternate command.

To be able to run the tests from vim by a single keypress, I added these lines to my .vimrc to bind the given command to F12, and the hardcoded alternate command to Shift+F12:

nmap <silent> <F12> :wall<CR>:silent !dumbass-kick.sh 1<CR>
imap <silent> <F12> <C-o>:wall<CR><C-o>:silent !dumbass-kick.sh 1<CR>
nmap <silent> <S-F12> :wall<CR>:silent !dumbass-kick.sh 2<CR>
imap <silent> <S-F12> <C-o>:wall<CR><C-o>:silent !
These commands also save all active files before running the tests. I added that because occasionally I forget or mistype the write command in vim and then waste some time trying to understand why the tests are misbehaving. All in all, this provides true one-button testing, you don't even have to exit from insert mode.

If you really want a global mapping, you can map a key to invoke dumbass-kick.sh in your window manager. That's a more lightweight solution than importing KDE libraries just to use their shortcut mechanism.

Even though I mostly use these scripts for running unit tests, they could be useful whenever you need to repeat the same command lots of times. For example, I have been toying around with Lilypond (music typesetting software) a bit, and I used loop on lilypond to generate DVI output on a keypress so that I could see my changes immediately without stopping to type and switching windows. The script could be useful with make when working with compiled languages.

Tuesday, March 1

Dictionaries

Having a computer look up words for you in a dictionary can be a great timesaver sometimes, especially if you need to check many words. There are several choices of a computerised dictionary.

A simple and straightforward choice is to use a web-based dictionary. You can usually find some quite complete and verbose ones with many examples. Some sites even provide other linguistic data (this database really impressed me). Besides, Google is always handy to search for extra information.

If you have a text that has many unknown words, it might be faster to run it through a general-purpose translator, such as translate.google.com. You will lose precision, but at least you might get a good laugh from the results.

Having the dictionary installed locally is more convenient (you do not need an internet connection) and faster (almost zero latency). For some languages specialised dictionary software is available, but there is also the dict network protocol which defines a standard way for a generic dictionary client and a dictionary server to communicate. This model is quite powerful.

Setting up a dict server on Linux is not hard, in Debian it's just an apt-get install dictd away. Note that you will need to install the dictionaries yourself. Debian provides some dictionaries, e.g., dict-de-en. There is also a number of dict-freedict-* packages, but I have the impression that they are not very complete.

Now that you have a server running, you need a client to use it. There are quite a few clients available, I will mention several:

  • dict - the command-line client. Just type dict foo in a terminal and you'll get the query results immediately. Very handy but not convienient for looking up many words
  • gnome-dictionary - the GNOME dictionary client (screenshot, look on the right - not to the point, but will do). Looks nice at first but in my opinion it is not very usable, I hate the popup window when no matches are found for a query. And it pushes GNOME's "live preferences" to the uncomfortable limit - when you enter a new server, the same preferences dialog box is immediately reconfigured to that server, which looks very awkward.
  • kdict - the KDE dictionary client (screenshot). I slightly disliked it because the input box would lose focus after executing a query, so I would have to use the mouse to enter another word. Jeez, even the web-based dictionaries get this right with some JavaScript. The problem can be worked around with by mapping Ctrl+L to the "Clear Input Field" action. I would rather it selected the word instead of deleting it, but this solution is satisfactory. In addition, kdict offers database sets, which turns out to be very helpful. In most dict clients you can only query either a single dictionary or all dictionaries available. Database sets are like virtual dictionaries. An example of a use case is when you want to translate from English to another language and you have several dictionaries for this purpose, but you don't want the general-purpose ones to get in the way.

After finding out about the Ctrl+L tweak I liked KDict best. I would prefer it to be a GTK+ application rather than Qt, but it is practical, which matters most to me.

Sunday, February 13

Piano resources

I get most of my piano sheet music from the internet, for free. The guitar players are probably in a even better position, with tablature for most pop songs, (and classical pieces too, I suppose) freely available. However, there is a fair share of free sheet music for the piano as well.

Because of the copyright laws, most sites do not contain material created later than the early 20th century. Finding modern pieces is a problem. However, there are plenty of public-domain classical (in the broad sense) pieces available. When searching for a piece you will probably want to visit several sheet music repositories before you turn to Google, which will give you heaps of trash to wade through.

If you are looking for a piece by a very famous composer, chances are that there is a dedicated site where you can download sheet music / recordings (e.g., www.chopinmusic.net, www.jsbach.org). You may want to look at such sites first, they are usually quite complete and the quality is good.

When I am looking for classical music, my first stop is usually the The Sheet Music Archive. The info page says that the site contains over 4000 pages of sheet music. It is a pity that it only allows two downloads per day (unless you are smart and guess the filename of the piece), but that is usually enough.

If you can read Russian, you might find Boris Tarakanov's Sheet Music Archive at notes.tarakanov.net very handy. I have only discovered it recently, but already found a few pieces which I had been looking for. You do not need to actually understand Russian to browse the site — an online translation service might work if you can't read text in Cyrillic.

mfiles.co.uk seems to have a little bit of everything. You can find various well-known pieces and some comments on them here. This site is nice to browse when you want are looking for new material.

It is sometimes helpful to hear a piece to decide if you like it before looking for the score. In other cases, you only know the composer and the melody, but not the name / number of the piece. kunstderfuge.com has an extensive collection of classical MIDI files. Of course, MP3s are more pleasant to the ear, but they are also harder to obtain and take much longer to download. MIDIs can usually give you the basic idea. Googling for MIDI files is easier than searching for free sheet music, but still a pain, so look here before you wander off.

Sibelius, the most popular notation software package, has a large repository of Sibelius scores on sibeliusmusic.com. You do not need to have Sibelius, but you will have to install Scorch (Windows and Mac only), a free browser plugin to view sheet music and play the pieces. Although all pieces on the site can be played and viewed for free, most (but not all) will cost a few dollars if you want to print them out. As this is a community site, there is also a lot of music here that is not worth your time, but there are quite a few gems to be discovered too.

If you are willing to pay money for the sheet music, the best sites are probably sheetmusicplus.com and virtualsheetmusic.com. This is just my impression however, as I have not used either of these services.

Finally, if you are after a relatively modern or rare piece, you might want to check out pianofiles.com. You will not find sheet music on the site itself, but it can provide you with e-mail addresses of people who have the piece you are after. You may then contact them by e-mail and ask politely to send the score to you. You might be asked for something in return. You can search the database anonymously, but the system will not show any e-mail addresses until you register. You can find a list of sheet music that I have on my member page.

Sunday, February 6

Cryptonomicon

I have recently finished reading "Cryptonomicon" by Neal Stephenson, kindly lended to me by Marius. To begin with, it was quite a bit thicker than I'd like (almost a thousand pages), but in the end this book was worth the time.

I think there is no point in retelling the plot, as you can find that in lots of places. In fact, I think that it was not quite top-notch. Most of the book feels a little sluggish and I could not see where things were going. Only the last hundred pages were really interesting plot-wise for me. I might have been happier, had the remaining part been shorter.

This book seems to excel at style, however. Stephenson is not afraid to spend lots of time describing elaborately crafted environments and delving into details. There is a fair bit of intelligent humour and sarcasm thrown in.

There are some geeky parts, about computers and cryptography, that made me slightly uneasy. Some I might consider insulting my intelligence, like the large sections about modular arithmetic or simple text transformations. The scenes concerning computers seemed out of place somewhat (why would I care if Randy was writing a bash or a Perl script, or what UNIX commands he was typing?). Maybe it's just me because I have a technical background, perhaps Stephenson just paid these details as much attention as he did to the non-technical ones.

Frankly, I do not have much to compare the book to, so take my comments with a grain of salt. Personally, I found the book quite enjoyable.

Thursday, February 3

PyQLogger reloaded

The author of PyQLogger (a nice and functional PyQt-based blogging client) has been very helpful and has promptly fixed most of the problems I had mentioned. Even an improved spellchecking interface is in the works. A version with the fixes included has not been released yet, but you can check it out from the project's Subversion repository:

svn checkout svn://svn.berlios.de/pyqlogger/stable-1.x pyqlogger

In addition, a Debian package is now available. Unlike the source tarball, the APT package does include most of the recent updates and fixes.

Sunday, January 30

Tidbits for developers who use the Vim editor

Vim is an excellent general-purpose text editor, but it is relatively stupid by itself in some situations. Thankfully, it can be scripted. You can find many scripts on www.vim.org/scripts, I will mention a few useful ones here.

Every decent code editor has a shortcut to quickly comment / uncomment a block of code. A similar effect can be achieved on Vim by installing FeralToggleCommentify. This plugin works with a great number of different filetypes. I especially like to use it to comment out HTML/XML markup because XML comment tags are tedious to type (the script comments each line of the selection separately though). The binding C-c, although on a nasty key, is quite useful as well - it makes a copy of the current line and comments it out.

If you work with Subversion and write commit messages with Vim, you might find svn-diff handy (you will need Python scripting support in vim). It shows the diff of your commit in a pane below the commit message. Besides, Vim does syntax highlighting, so the patch is easier to read than on the console. Remember that if you see something wrong in the diff, you can always abort the commit by not saving the commit message file or, if you have already saved it, deleting the file (:!rm %).

Python coders will appreciate the smart handling of indentation by the alternative python indent script. It is sometimes too smart and therefore annoying, but works well in most cases.

The XML editing plugin is also nice to have if you deal a lot with HTML / XML. Its functions appear to be quite useful (I keep forgetting the bindings): jumping between opening and closing tags, enclosing content in tags and deleting enclosing tags, etc.

If you work in Vim a lot, I highly recommend to remap Escape. The standard position makes you move your hand away from the home row. I have remapped Escape to Caps Lock which I never use. It takes a few minutes to readjust to the standard Escape position when working in Vim on other computers, but that is a small price to pay for the increased productivity. You can remap the key on X by using xmodmap. Run xmodmap ~/.xmodmaprc after creating the .xmodmaprc file in your home directory that contains these two lines:

clear lock
keycode 66 = Escape

Make sure to have a look at a post by Marius Gedminas about CTags and id-utils if you are not familiar with these two timesavers.

I have uploaded my vimrc, maybe you will find something useful in there. I highly encourage to go through the Vim internal features that are turned on in the script, you will probably want to use most of them. Do not expect the script to work out of the box, you will have to remove the sections that depend on other scripts and plugins.

Time bug in kernel 2.6.10

The kernel 2.6.10 has an uncanny ability to lose track of time when it is sleeping. Whenever I would suspend my computer and then wake it up, I would find the clock hours or even days ahead from reality.

I tried to circumvent the problem with hwclock, but that made things even worse. Evolution would get extremely confused the moment I resumed my laptop and it would suck up all CPU and then some. In the end I would have to switch to a text console, login (wait 20 seconds for the prompt to come up), killall evolution evolution-alarm-notify evolution-data-server-1.0, wait another 5 seconds for the processes to actually get killed and then see the flurry of messages from modules being loaded by my resume script.

To fix the timeshifting problem I had to modify the code as described in this LKML post. The fix is in 2.6.11pre. If you decide to try out 2.6.10, and you use suspend, make sure to apply the patch.

A new Scheme standard in the works

I was surprised to find out that R7RS, a new revision of the Scheme standard, is being prepared. There is a report (PDF) on the progress of the standard.

The most important thing in the new standard is that a module system will be defined. Some interesting things have been considered, such as language case-sensitivity (no decision), non-hygienic macros (no decision), square brackets equivalent to parentheses (passed). Lots of practical suggestions (Unicode support, regular expressions, I/O, hash tables, object-oriented programming, network programming, serialisation, etc.) are also being reviewed.

I was a bit disappointed that a lot of ideas are either still not decided upon or have been postponed for R7RS. It is still nice to know that the language is still evolving and improving

Thursday, January 27

Nasty memory leak in X

Quite recently a friend of mine has found the cause of a leak in X that has been plaguing me for ages. The problem is that the X cursor library does not free animated cursors properly. As a result, my X server would hoard hundreds of megabytes of RAM after a few days of use.

If you are using an animated mouse cursor theme, you might want to check if you are affected (most machines seem not to be for reasons that I have not yet found out). Run top and find the XFree86 process (you might want to sort by memory usage by pressing M). Then have a look at the value in the 'RES' column (it is typical for X to hoard lots of virtual memory, so you should not pay attention to the other values). If it is significantly more than about 40m, chances are that you are experiencing the aforementioned leak. I would suggest working around the issue for now by switching to standard non-animated X cursors.

More information about the bug and a sample application to reproduce the problem is available in the Debian bug tracking system. As far as I can see, the problem has not yet been fixed. By the way, my tests indicate that the problem is pertinent to X.Org too.

Wednesday, January 26

PyQLogger

Since I found out about Gnome Blog, had a look, liked what I saw, and then discovered the fact that it could not set post titles on blogger.com, I have been looking around for other blog applications that can cooperate with blogger.com. I have found several, and finally settled with PyQLogger.

While PyQLogger does not look as neat as Gnome Blog, it is quite functional. It uses a smart interface to blogger.com - posts can also be retrieved from blogger.com, edited and republished without touching the web browser. I really liked the idea of saving unfinished drafts. HTML highlighting and easily accessible post preview are quite nice to have. Spellchecking is also supported, although it is not very convenient.

Since PyQLogger is based on PyQt, it does not blend in with my Gnome desktop, but I do not really mind. There are many worse things. The interface is a bit overcrowded. There are some very annoying bugs. The hyperlink dialog is buggy (the "Cancel" button works as "OK", and the "OK" button does not work at all). And it is annoying that Home/End keys jump to the start / end of the paragraph rather than the active line. The OSD library seems to be used out of plain vanity in situations where a statusbar message would do.

Still, PyQLogger does its job. If someone weeded the bugs out and cleaned up the interface, it would be excellent. For now, it is acceptable, but still better than most of the other clients I have tried.

Monday, January 24

Kernel 2.6.10 on a Toshiba laptop

I upgraded the Linux kernel to 2.6.10 about a week ago. I had been using 2.6.8. Here are some impressions.

There were a few pleasant changes. I have already covered CD packet writing earlier. ACPI support has also improved a bit. A minor but very pleasant detail is that my laptop automatically resumes now when I open the lid.

I was surprised to find out that suspend-to-disk kind-of-works now. It used to hang during resume, now it resumes successfully, but maims USB. After resuming the kernel starts spewing messages such as these:

Jan 16 17:26:42 localhost kernel: usb 1-2: khubd timed out on ep0out
Jan 16 17:26:49 localhost kernel: usb 1-2: khubd timed out on ep0in
As my USB mouse stops working after this, suspend-to-disk is still not viable for everyday use to me, but at least it seems to be getting there.

I have also noticed that the clock would at least a few hours off after each resume. I have worked around by saving the system time to the hardware clock right before suspending, and then restoring the time just after resuming. Nonetheless I keep finding premature Evolution alerts after resuming. Really annoying, I hope this will be fixed.

3D support in X still breaks after suspend. It's not really important though, I can restart the X server when I really need 3D working. Besides, it is quite slow anyway.

People who own a Toshiba A15-S157 laptop might find my kernel configuration and suspend script useful.

JScheme

I have recently come across an interesting project called JScheme, which, as far as I understand, is something similar to Jython, but for the Scheme language. As Scheme syntax differs a lot from Java (Python syntax is very similar in comparison), JScheme adopted special syntax called the Java dot notation to manipulate Java objects, which is quite simple but adequate.

Since JScheme is implemented in Java, you can run it as an applet too. An online demo is available. The interface is cumbersome and ugly, but enough to demonstrate the concept.

Saturday, January 22

Packrats

It is interesting that people have absolutely no problem filling up latest storage devices despite the rapid advances of data storage technology. This topic has seen some attention from the Slashdot crowd. I do not think that the amount of useful data on a typical computer changes at the same astounding rate. So, where does the space go?

Probably most of the data is multimedia - movies and songs. I wonder how much of those are just played once or twice and then left alone for the rest of time, or until the hard drive dies, whichever comes first. It is none of my business what people do with their multimedia, but I think that it is foolish to hoard things that you will never use.

Movies, in my experience, are a one-off thing. Some do deserve several viewings, but then you do not wait a year before watching them again. I can not think of a good reason not to delete the movie, unless it is one of the few very specials that you actually would want to see after a long period of time passes. Could it be because it "took me a week to download, so it's precious!" Perhaps a friend might want it some time in the future (I can hear the MPAA growling, but I'm not in the USA so I do not care), but then save him/her a few hours and lend a good book instead.

Songs are a slightly different matter. In particular because I have a larger store of them than I would want to (80MB of MIDI from the old times, 5GB of MP3s just locally on my laptop and 350MB of sheet music). I could argue that I do listen to various pieces occasionally. I understand why people are reluctant to delete music. However, I still maintain that collecting for the sake of collection is not a bright thing to do.

Collecting trash is not smart either. Some store all bits of information that they have come in contact with. I say, who cares about the SMS messages, archives of mailing lists, notes, articles that you found "interesting" at the time, artifacts of experiments and primitive one-off scripts. In theory, they could come in useful one day, but in practice they do not by overwhelming proportions.

I suspect that some people put up with the trash just because they are lazy to clean it up. Unorganised files pile up and then in a few years they have trouble finding anything. Well, I think that just like in real life it pays to have a tidy work environment.

The big question is, why am I discussing this? I think that usability and efficiency problems arise out of the sheer amounts of cruft accumulated, and we do not notice. Technical ones too, but that is not important. Database-like file systems are promising, but maybe we could go along with what we have now if we revised our habits a bit. I am not comfortable with the fact that the amount of useless information is increasing so rapidly, and we are battling it by improving searching technology. It would be much better if the signal-to-noise ratio could be improved.

That is a sensible message for developers too: users should be encouraged to throw away what they do not need.

Such an approach of tidying things up does not work on distributed systems, for obvious reasons. That is why we do need good search capabilities and the semantic web. However, in most cases, you are the boss of your computer, so locally you can organize things however you want.

There are practical reasons to keep only useful data around other than searching. Backups are smaller and therefore easier to do, therefore you do them more frequently, therefore your data is safer, QED. Furthermore, there is less risk to accidentally throw away something important while cleaning up junk when you need some extra space. And for me it's a nice feeling that my computer is not a huge pile of virtual trash with the important things buried somewhere inside.

Sunday, January 16

CD packet writing on Linux

The CD packet writing mode on Linux allows you to mount rewritable compact discs as ordinary read/write media. It brought my attention when I was upgrading my Linux kernel to 2.6.10. Apparently, patches to implement packet writing were floating around previously for some time, but now it has made it into the mainline kernel (darn, can't wait until reiser4 is merged).

It is fairly obvious that packet writing support makes using rewritable discs much easier. I have abhorred floppy discs since I got a CD writer. Floppies are so unreliable and their capacity is appalling by modern standards. However, burning CDs involved much overhead - I had to start a burning application, find an unused CD, erase it, create a new compilation, and then finally burn it. That made CDs inconvenient for storing and exchanging small documents of several megabytes.

Probably the biggest disadvantage of packet writing is that you lose about a sixth of the disc capacity - GNOME shows 550MiB free on a fresh 700MB disc. That means that you will have to burn movies the old-fashioned way, otherwise they will not fit.

Packet writing is slower than burning normal images, which is not surprising. I did a small benchmark. Copying a directory of 11 files, about 10 megabytes each, 90MB in total, took about two minutes of real time. My CD burner supports 10x (1500KB/s). I included spin-up time and unmount time. Linux seems to cache writes, so work with the disc is snappy (well, until you fill the buffer), but unmounting takes some time. In theory, the minimum amount of time required would be one minute plus spin-up time. The result is not so bad. If I was burning the data as an image, I would have had to first erase the disc and then wait while lead-in and lead-out were written out in addition to the data itself.

Compatibility with Windows is extremely important. Modern versions support UDF without problems. I just tried to read the CD on Windows XP. It had the title of "LinuxUDF", included the folder "lost+found", and a empty file "Unallocated space", but other files were read properly. Oh, and that darn GNOME created a directory called .Trash-gintas which I had not noticed. Out of curiosity I tried to write to the CD on Windows by drag and drop, and it did not work, therefore Linux CD-RW support is superior.

Here is a guide to get CD packet writing working on Debian GNU/Linux (with udev):

  1. Upgrade the kernel to 2.6.10. If you are compiling it yourself, you will need UDF read/write support and packet writing support (CDROM_PKTCDVD).
  2. modprobe pktcdvd (add pktcdvd to /etc/modules if you want). You obviously do not need it if packet writing support is statically compiled into the kernel.
  3. apt-get install udftools You need not create the device nodes when asked to, they are for the older packet writing interface.
  4. Edit the file /etc/default/udftools. Uncomment the DEVICES declaration, set the value to your CD device, "/dev/cdrom" should work in most cases. You will need to run/etc/init.d/udftools start afterward.
  5. Find an unused rewritable disc, preferably one that supports fast speeds, and insert it.
  6. Run cdrwtool -d /dev/cdrom -q. The CD will be erased and formatted, an UDF file system will be created on the disc.
  7. While the CD is being prepared, which will take a while, mounting can be set up. Create the directory /mnt/cdrw, then edit /etc/fstab, add this line:
    /dev/pktcdvd/0 /mnt/cdrw      udf       rw,user,noauto,noatime 0 0
  8. After the CD has been prepared, mount /mnt/cdrw.
  9. I had some permission problems - /mnt/cdrw was owned by root with no word-write permissions. If you make changes to its permissions, they appear to be saved into the CD. I used chmod 1777 /mnt/cdrw. Maybe there is a better way.
  10. Try copying files to the mounted disc. I especially like the fact that deleting files is trivial. Run umount /mnt/cdrw when you're done playing. If you use GNOME, you should also be able to mount and eject UDF discs by a single click, just like standard CDs. Use the CD-RW icon instead of CD-ROM in the Computer window.

By the way, you can use any file system, not just UDF, but remember to use the noatime option so that the disc is not written to on every read. You do not want to put too much wear on the disc, as they can endure about a thousand rewrites.

Saturday, January 15

On blogger.com

Here's an easy one for the Friday night.

A few people have been asking about blogger.com, I thought that I could briefly cover my experiences here.

Things I like most are the fact that there is no hassle with setting up software on a local server, and that design is a matter of choosing a template. The former has a few drawbacks - I can not have trackbacks as they are not yet supported by blogger.com, and backups are not trivial. I do not care much about either of these. As for design, I can make things look neat, but not attractive and beautiful. Preset templates saved a whole lot of time, although I did spend almost an hour picking one as they are all quite nice.

My absolute worst complaint about blogger.com is that it treats paragraphs in a completely braindead way. There are two modes: either no paragraph breaks are inserted (you have to write <p> and <br> tags yourself), or each line break is replaced with an <br> tag. That means that you can not have normal paragraphs in <p> tags. Even if you leave an empty line, two breaks are inserted and that's it. I decided to stick with the manual markup option. The developers of blogger.com really ought to address this problem, it does not look hard to fix.

I was a bit appalled by customisation of links to other blogs. Let me tell how templates work: you choose one from a list (previews are available) and then you can further fine-tune the HTML source. Well, you have to edit the source to add links, which means that if you decide to change the template, they will be lost. I had hoped for a more generic method - links are just a collection of titles and hyperlinks, how hard can it be to implement a page to deal with them?

Today Marius Gedminas brought my attention to gnome-blog, a small GTK app for posting to blogs, written in Python. It looks a bit unpolished, but seems to be quite functional. And judging from the code it should insert those dreaded <p> tags to form paragraphs. This is my first post through this nice program, I will soon find out if publishing works correctly.

In conclusion, I could recommend blogger.com as a way to publish blogs without hassle. It may not have the latest features, but there should be more than enough for most people.

gnome-blog report: it did handle the paragraph tags correctly (yay!), but the title I entered was put as the first line of the post, and the real title of the post was left empty. Well, since it's Python, such a minor problem should be easy to fix.

Friday, January 14

Me, I, umm, myself...

While writing English texts, especially this blog, I frequently shuffle with discomfort as can not find a way around using the words "I" and "me" several times in almost each sentence of every paragraph. I guess that using these words a lot is natural, as I am writing a lot about myself, yet it takes some getting used to for me. Using passive voice is an alternative, but it causes more stylistic problems than it solves. I think that some prolific writers structure their sentences to dodge the problem somehow, but I just can not grasp, much less adopt it.

The awkwardness probably comes from my native language. In Lithuanian there is no need to repeat the "I"'s because all verbs have a first-person form, if my phrasing is correct. I can only think of the word "am" in English, as in "I am here", as an analogy, other verbs do not have the difference ("I live" vs. "you live" vs. "they live"). Actually I frequently spot naïve translations from English which indicate the agents explicitly when it is not needed, and this makes the sentences more blunt and unfriendly. I seem to remember a text which discussed fundamental differences between languages where the agent is implied and ones where it is not, so I might have covered just the tip of the iceberg (note to self: cut down on clichés).

One could speculate if using "I" and "me" frequently has a subconscious effect on the writer (or the reader). However, as I am not a psychologist, I will refrain from discussion.

I will again slightly change topic here and use the chance to mention E-Prime. In short, E-Prime is a subset of English which does not include forms of the verb "to be". There are already enough resources to explain it on the web (such as this one), so I need not do that here. I just thought I would mention it as it tries to deal with other aspects of bluntness of the English language.

Thursday, January 13

¿Punctuation?

Most of my friends know my uncanny ability to notice typos everywhere. I am not sure why that is so, as I do not read all that much (that I do intend to change). There is a bit of irony that I am working on SchoolTool (school administration software) at the moment. Darn, I wish I was good at punctuation instead of spelling, which is becoming less useful as automatic spellcheckers get better. Which brings me to the point.

Recently I have come across an interesting essay called "The Philosophy of Punctuation" by Paul Robinson. It is rather wordy, but well worth reading. I have picked a few things that I could apply to my own writings.

Most importantly the essay brought my attention to the fact that I am overusing punctuation. I used to like long, intricate sentences with dashes, parentheses and semicolons. Connecting ideas in various ways seemed interesting and original. However, Robinson claims that overusing punctuation indicates a lack of writing ability. "Expository prose is linear," and therefore ideas should be presented sequentially. Dashes and semicolons "betoken stylistic laziness," they are a sign of weakly connected ideas.

With regard to long sentences, I read somewhere on the web that varying sentence length helps keep the reader interested. I still rarely use short sentences, but I'm getting better. At least my recent writings are stylistically lighter and more lucid than ones I wrote a few years ago. I am still dissatisfied with structure of my writings, that will probably take time to improve.

I found the personified descriptions of the punctuation marks highly entertaining and persuasive. The style also helps the message that it is not a good idea to try to outwit yourself. It is not clever to obfuscate content, and trying to be modern is a bad excuse. Making the text overly complex means that you distance yourself from the reader, which hinders not only readability of the text, but emotional response as well.

Robinson also expressed the old idea that "Good writing is as much a matter of subtraction as creation." This is good to be reminded of, and so obvious that I need not add anything.

For technical people who care about their writing I could suggest to have a look at Lyn Dupre's "BUGS in Writing: A Guide to Debugging Your Prose" (amazon.com). Although in my opinion sometimes Dupre overshoots while trying to attain lucidity, that may not be necessarily a bad thing as the goal is to help people who are already used to a formal, rigid writing style.

Friday, January 7

Python unit test runners

As I do test-driven Python development almost every day, I care a lot about testing infrastructure.

Currently my development environment of choice is Vim + a terminal (for running PyUnit tests). It's rather low-tech, but it works reasonably well. I usually do not use Vim's QuickFix because it's a bit cumbersome to use (I do use quickfix in combination with idutils). I could outline two main inefficiencies of my current approach.

First, to run the tests I have to save the changed files (which I occasionally forget to do), then M-tab to the terminal window and use Up,Enter to run the previous command, which was most likely to run the tests. The latter step, which is now pretty much a reflex as I do it hundreds of times a day, does unexpected things when I forget that the previous command was something different. I have tried to improve upon things by using a shell snippet while true; do ./test -pv foo bar ; read; done, which requires only Enter to be pressed. I find it hard to drop the habit and still press Up,Enter most of the time. Well, at least unexpected things do not happen. Yet this is clearly a poor man's approach. It also makes changing the arguments to the test runner slightly harder.

Second, I am currently doing the jumping to errors manually. QuickFix could do it automatically, but it sometimes does not work as I would like it to, because sometimes I want to jump to a function higher up in the stack rather the one that raised the exception. Typing the line numbers in by hand is not that much trouble, but tedious.

The page that turned my attention into my testing routine inefficiencies is One Button Testing. It contains some interesting ideas. I especially liked the suggestion to use a smart test runner that automatically runs the tests when you save the code. Now I am thinking that, for starters, it might be quite easy to write a small script that would monitor a directory (either by polling or by utilising dnotify/inotify) and would invoke an ordinary text-mode test runner when a file changed.

I have thought up some features (in addition to the usual ones) that my imagined test runner should have:

  • GUI. Command-line interfaces are really powerful and great for scripting, but they require typing, which is exactly what we want to avoid. (I would not mind if the backend was a text-based test runner.)
  • Global keyboard shortcuts. Some examples: C-M-a to run all tests, C-M-l to only run tests that failed the last time, C-M-s to show and hide the test result window. All without having to switch away from the code editor.
  • Notification area support. It would be great to see without switching windows if a test run is in progress, whether the last run had failures or not, and other things.
  • File monitoring. This is the idea from OneButtonTesting. The critical thing here is to run the tests only for the module that has been updated. Otherwise we either have to run all the tests of a project (slow) or type in a specific subsystem to test (error prone and requires typing).
  • Integration with Vim and Emacs. As discussed, clicking on a traceback to jump to corresponding code in the editor would save some keystrokes. It would be great if it also did not require using the mouse (e.g., use Left/Right to switch between failed tests, Up/Down to walk between stack frames in a traceback, Enter to go to the highlighted code)

There are already quite a few test runners out there, and some of them already implement a few of the items in the above list. First, the text-only ones:

  • The Zope 3 test runner is a quite usable, and has some nifty features.
  • The SchoolTool test runner (download) has been written by Marius Gedminas, my colleague and a very bright Python hacker (if you are a Python programmer, make sure to have a look at his blog). I prefer this test runner over the Zope 3 one, because it is faster (demonstration) and neater.

There are also some GUI test runners out there already. In my opinion they are inferior to the text runners available, at least the ones I have tried. Here is a list:

  • The PyUnit test browser has integration with vim, idle and kate, but that's about it. It is hard to figure out, requires a few dependencies and in general looks unmaintained. Besides, it uses Tkinter, therefore it looks ugly.
  • The PyUnit GUI test runner (screenshot) is rather simple and not very useful. It uses Tkinter too.
  • Pycotine is for Macs, I have not tried it.
  • PyUnitOSX seems to be a port of the PyUnit GUI test runner to the Mac. I have not tried it either. Well, at least it looks nicer than the original.
  • The NUnit GUI test runner (screenshot) is not a Python test runner, but I wanted to mention it because it gets some things right. Notably, it sports file monitoring -- whenever you rebuild your project, the runner notices and resets status of all tests. Besides, it is not ugly. However, other things in my wishlist are missing.

As a sidenote, I would like to mention the std library by Armin Rigo and Holger Krekel, two top-notch Python hackers. However, it's buried in obscurity, I do not think it even has a web page. You can check it out from the CodeSpeak Subversion repository (here). The library makes unit tests more Pythonic by doing dissection of expressions (it provides values of context variables with the traceback). It also encourages use of the plain assert statement, which is so much cleaner than the Java-esque self.assertEquals and others. As assertEquals was primarily used so that the value of variables would be shown on failure, there is no need to use it if you use the std library. There are more cool ideas, go see for yourself.

I would implement the stuff I described myself, but I have quite a few unfinished matters already (I will cover them in other posts) and my conscience would not let me drop them and start a yet another project. Therefore, I am hoping that someone with too much free time and good Python skills will have a look at my wishlist, say "These are neat ideas!" and go on and implement them in a couple evenings. Make sure to let me know if you do. By the way, I have seen a pygtk-based GUI test runner by Marius Gedminas . I suspect that it is rather incomplete, but I would not be surprised if Marius, being so ingenious, came up with something really useful, as he has with the SchoolTool test runner.

Darn, another essay-post. Indeed, my original intent was to provide content, but this is too time-consuming to keep up. I will have to try to be more concise.