Berkeley DB XML

I just went surfing Sleepycat Software to read some documentation (don't ask me why, I just realized I have it all installed locally), when I noticed this: Berkeley DB XML. Apparently not yet released, but interesting anyway. I'm not big on XML DBs, but with Sleepycat doing it, it might be worth checking out. At least they have a great history of doing high-quality, simple, lightweight products that just do their thing with minimum fuss. And if they continue with their licensing policy, all the better. Then we just need some Python bindings...

Article page

Content management tools fail, Plone marches on

A Jupiter study — I'm usually highly sceptical of these, but hey, maybe they're right for once — says businesses are often dissatisfied with their content management solutions. Complexity, lock-in, overengineering, ridiculous prices abound.

Meanwhile, in Plone land: Plone Improvement Process (a formalized approach to improvements: this sort of thing seems to be pretty popular in the larger free software projects these days, have you noticed?) has a few interesting improvement proposals for an already fine system: Oslo Skin system and Navigation and validation rewrite.

These coupled with improvements to and integration of CMFTypes and possibly the brand spanking new TTWType promise a bright future. It's not perfect and probably not suitable to all situations (there's a clear community site bias, but it doesn't have to be used that way) but it is easy, powerful and has a lot of momentum. And clearly there's still lots of space in the CMS field to conquer. Zope 3 is coming, but this Zope 2 thing is nowhere near its end of life.

Article page

Asynchronous fetching seems to be working

Coolness. Straw is actually successfully fetching the dozens of feeds I'm reading, asynchronously, without the troublesome network thread. DNS lookups with ADNS (ha, yes, another dependency), reading stuff off network via asyncore. And the UI is pretty responsive. It's still slower than it used to be, but I think that's a tradeoff I can make. And, as I said earlier, things can possibly be tuned to work out better.

However, one somewhat serious usability problem remains: the stuff is basically fetched off a queue, where things go like this: request an URL -> resolve the host asynchronously -> add the split URL + IP to the fetching queue -> fetch data asynchronously -> stuff the results into the internal datastructures. Now, when you press 'g' or when the timer triggers a polling run, we request all the RSS URLs, then when we have finally gotten something, we parse it, see if there are any images, and request those. See the problem? Unless name lookups are in some cases seriously slow, all the articles are fetched before any of the images. I suppose that's fine if you are interested only in the text, but if not, and you are sitting in front of your aggregator, waiting to read the words of the bloggerdom hot off their keyboards, you get to read the items without images.

So maybe I should prioritize the images higher. That is, stuff the image requests to the front of the queue. And I suppose that as long as we are talking data structures instead of, say, things you see in Helsinki in front of bars Friday and Saturday evenings, it won't be a queue any longer. Oh well. That's theory for you.

Article page


The asynchronous networking rework of Straw is progressing pretty nicely. I've already got the basics working: feeds are polled successfully and images are fetched. The only significant bit of code to be converted is the subscription tool / rssfinder.

It hasn't been totally easy. Converting an existing code base which depends on synchronous network operation to be asynchronous is a bit hairy in places. In fact, rssfinder is one beast which makes me consider leaving some synchronous networking in place, but we'll see about that.

Things aren't all peachy in the brand new asynchronous world, though; first of all fetching data over the network is, at the moment, noticeably slower than in 0.15. However, I suspect that tuning a few constants (timing and limits on open sockets) a bit might help here a lot. And Straw speaks HTTP 1.1, I could try using the same connections for multiple files, even though that would complicate the code a bit.

The second problem is that DNS lookups are still freezing the program. First thing to do, I guess, is to implement a name cache, so it'll be an one-time (per session) cost. But I suspect I'll have to do something about the lookup part itself. The options are, I guess, either using ADNS or forking a separate lookup process. I've been dependency-happy in the past, but I'll have to think what to do with this one; I have no idea how common ADNS is.

Oh, and a big THANK YOU to Fredrik Lundh for the series about building EffNews on his web site. I have no prior asynchronous networking programming experience, and his examples have been extremely helpful.

Article page

The Problem With XML

XML, despite still having immense momentum, is not without problems. These days it probably is the pragmatic solution for many jobs, thanks to the multitude of tools for various platforms for processing it. That doesn't mean it couldn't, and shouldn't, be better. Not surprisingly, people persist in discussing its faults and how it should be.

Erik Naggum, in his characteristic style, details his beef with XML with a very long article [via Python Owns Us] where he makes many good points about the evolution of data formats and gives his suggestions on what should be changed in the XML format to make it better. While I'm not totally convinced about the binary format bit, otherwise I do agree with his points.

However, I think Aaron Swartz put it best in one little sentence, even if he didn't suggest any improvements: "XML [...] encourages you to be conservative in what you accept and liberal in what you put out.", in direct contrast to the traditional — and sensible — ethos of protocol designers. That is the single thing that is most damning to XML. Thanks to Aaron for articulating it.

Article page