Choosing A Language

24 June 2004 at 23.15 • in General

My last post reminded me that I haven’t expounded on (or indeed, mentioned) my choice of implementation language for Trifle. First, a word of explanation: I am a language junkie. Learning a new programming language is one of my favorite things to do. Not surprisingly, then, the hardest part about choosing a language has been just picking one and sticking with it.

Early on, I thought long and hard about designing my own language to write Trifle in. After much experimentation, I finally listened to the little voice of reason and decided to choose an existing language (and perhaps extend it DSL-style). My list of potential languages was eventually whittled down to Scheme, ML, and Python.

I’ve been a fan of Scheme ever since it rocked my worldview in college, but haven’t had the opportunity to use it in large programs. And unlike in my college days, there are now well-rounded Scheme implementations like PLT Scheme, which have a decent set of libraries and an active user community.

ML is one of those languages I keep thinking I’ll like, but never really do. When I first encountered ML (in the form of SML/NJ), I thought “it’s Scheme without the parentheses and with static type inference — what’s not to like?” Since then, though, I’ve come to appreciate Scheme’s (and Lisp’s) parentheses, both for clarifying the structure of nested expressions (which functionally-styled programs tend to have a lot of), and for the gift of macros. On top of this, ML’s admittedly elegant type system seems to inevitably lead to incomprehensible type errors whenever you do something as simple as misplace (or omit) a parenthesis. (BTW, on this go-round I tried OCaml as well, and found its syntax unnecessarily odd and its error messages even more obscure.)

Python impresses me the most of the current generation of “scripting” languages. I thought I’d hate the significant indentation, but have grown to love it. Unlike many other languages, it feels like it was designed for maximum usability (rather than for provability, paradigm enforcement, etc.). There’s an huge and growing community of Python users and a neverending supply of open-source Python modules.

So, I finally chose Python as my initial implementation language, with the deciding factor being the community and the available libraries. I even coded several experiments in it. But of course I couldn’t leave well enough alone. After sorting through many twisted attempts to simulate concurrency in Python without using threads, I made the jump to MzScheme on the strength of its concurrency support. MzScheme has ultra-lightweight OS-independent threads, with communication primitives modeled on those of John Reppy’s Concurrent ML.

I’ve become very comfortable in MzScheme, but the availability of PyLucene has made me once again wonder if the overwhelming availability of libraries for Python is its trump card. After all, Trifle as I’ve outlined it will be heavily involved in using and joining together disparate components, and there’s no question Python is better for that than MzScheme (right now).

PyLucene Released

24 June 2004 at 23.11 • in General

The nice folks at OSAF have just released PyLucene as a separate project from Chandler, all packaged up and compiled with no JVM dependency in sight.

This might be enough to get me to switch back to Python as the main development language for Trifle.

Orthogonal Components

23 June 2004 at 17.11 • in General

I’m still trying to make sense of the different choices of components and how to combine them into a coherent, extensible whole. The main decision is how and where user data is stored; once that’s chosen, the question is how to do searching, organizing, versioning, and sharing of that data.

Here’s one take on how it could be: User data is stored in text files in a single directory tree. Alongside the user’s actual files, metadata is stored in record-jar format.

This tree can be versioned using any ordinary VCS (e.g. Subversion). It can also be indexed and searched by a separate component (e.g. Lucene) that doesn’t need to know about the versioning. With the right kind of input-munging of the record-jar metadata, you could even use the search component to do metadata searches (Lucene, in particular, supports fields and field-based searching). The tree itself could easily be shared out by any Web server, and given all the heavy lifting already accomplished by those three components, a front-end UI would be a relatively thin layer. Not only that, but running scripts against the tree and doing low-level fix-ups would also be easy.

Components: XML Databases

22 June 2004 at 14.34 • in General

When it comes to semistructured search (as opposed to unstructured full-text search), I’m a little more inclined to roll my own in Trifle, perhaps along the lines of Agenda. To keep this project manageable, though, I know I’ll need to reinvent as few wheels as possible. Toward that end, I’ve been scrutinizing the current crop of XML databases to see if anything might fit my needs.

Berkeley DB XML, Xindice, and eXist look the best so far. Of the three, eXist stands out as having built-in full-text-search support, so I might have a chance to kill the proverbial two birds with one database.

How does versioning fit into that, though? Or, for that matter, having a plain directory tree of text files, which is awfully nice for all kinds of quick script hacks?

Components: Subversion

22 June 2004 at 14.19 • in General

The Subversion version-control system is, design-wise, simply a better CVS, with the obvious flaws fixed (e.g. file renames now work like they should). This makes it a lot less ambitious than arch or darcs, which try to decentralize the underlying model so that it doesn’t depend on one central repository.

Subversion works very nicely, though, and its built-in support for WebDAV as an Apache module would get me a lot of functionality for free in Trifle. One touchy point, however, would be how easy it is to hook up a search engine to be able to search past versions of files, not just current ones. This might well be more difficult with Subversion than with other approaches, since Subversion stores repository data in Berkeley DB rather than in text files.

Components: Lucene vs. Swish-E

22 June 2004 at 14.19 • in General

I don’t want to reinvent the search-engine wheel, so I’ve been looking for a nice full-text-search library. Lucene would be ideal were it not implemented in Java — I really don’t want a Java dependency. However, the Chandler folks have compiled Lucene to native code via gcj and then hooked the result up to Python, resulting in PyLucene. Perhaps something similar would work for me.

The other leading open-source search-engine contender is Swish-E, which is a C library. Unfortunately, it’s under the GPL, so I’m reluctant to use it (as I mentioned before).

Components: Mozilla

22 June 2004 at 14.18 • in General

The Mozilla platform looks like a very solid base on which to build a cross-platform UI. I go back and forth on whether I should use XUL to build Trifle’s GUI or try to make it with pure cross-browser HTML/CSS/Javascript. Either way, I’ll be accessing that GUI via Mozilla Firefox most of the time, so I’m considering a two-level strategy where Trifle-in-Firefox has all the bells and whistles, but Trifle-in-IE degrades gracefully (if I’m accessing Trifle from a library computer or something).

Components

21 June 2004 at 18.14 • in General

Here’s another short series of entries, this time about components I’m thinking of using. I’m hoping that Trifle will build upon open-source components to the point where all I have to write is some connecting glue and a nice front-end UI. (I can dream, can’t I?)

My ground rules: any component must be open-source, and must be cross-platform (it must run on WinXP, Linux, and Mac OS X). I’d prefer to avoid any GPL’d components, ’cause I want to release Trifle under the LGPL or a BSD-style license (not sure which yet). That’s a bendable requirement, though; if an otherwise ideal component comes along that’s under the GPL, I’ll probably use it.

We Don’t Need New Ideas

18 June 2004 at 14.31 • in General

Tim Bray picks up on the same point that jumped out at me from Joel Spolsky’s Win32-API missive: full-text search has been around for ages, yet there’s still no simple way to search your files. This specific point about search has been making the rounds recently, as more and more people point out how much easier it is to find something on the Web than on your own hard drive.

The bigger point behind this is that even in the ultra-young world of computing, there are still scads of great ideas that haven’t yet made it into common use. We don’t need new ideas — we just need to make use of the old ones.

Case in point: After decades of being ignored by mainstream programmers, garbage collection has finally made it into common use, and people are realizing what Lispers and others have been saying all along: garbage collection boosts productivity and reduces bugs. Now, there were good reasons GC didn’t catch on earlier. In the olden days, GC was too slow and memory-expensive for production use, but with the steady march of Moore’s law and the improvement of GC algorithms, we’ve reached a point where many, if not most, applications are best written using GC.

I wonder what other ideas are out there that might engender similar benefits, but still haven’t made it onto the radar. As Bray and Spolsky point out, full-text search is one — with today’s huge hard drives, there’s plenty of room to keep a big index. I think another no-brainer use of disk space is to keep past versions of everything; Joey Hess is onto something.

Related Projects: ZOË

17 June 2004 at 18.19 • in General

ZOË is an email proxy that slips in between your mail server and your mail client — and then organizes and indexes your email up, down, and sideways, with a Web interface. A great idea — too bad it’s only for email.

Jon Udell wrote a good article on ZOË and What It Means. I agree heartily with his assessment that “fulltext search … is only part of the value that ZOË adds. Equally useful is the context it supplies.” I hope that Trifle can offer the same combination of fulltext search and metadata context, but for a wider range of data than just email.

Next Page »