Parsing Atom Feeds

The Feedreader is one of the standard applications that ship with qooxdoo. Behind a rather unostentatious surface it features quite an interesting showcase of networked OO programming. Feeds are retrieved from a server, processed and displayed in a three-pane layout. The feed retrieval works through a small proxy (to overcome same-origin issues) that delivers the data in a JSONP format which contains feed meta information and a list of individual feed entries. When you look at an individual feed entry in Feedreader you might notice the “read more…” link at its bottom. The idea behind this item is to link to the original feed entry and open it in a new window when clicked. To populate this link for an RSS entry is straight-forward, and the current Feedreader offers a couple of them.

But two of the feeds are atom feeds, and here the story is different. The RFC defining this protocol, 4287, lists a whole bunch of elements a feed entry can have, but it seems none of them is the definitive place for a URI that could hold the link to the entry’s source. There are things like “atom:content@src” or “atom:id“, but their semantics are often vague or their values are restricted to IRI‘s, “International Resource Identifiers”, which are – roughly speaking – arbitrary strings as long as they are world-wide unique. Often the processor is explicitely required not to assume they are dereferencable (e.g. with atom:id).

This leaves it up to the feed provider if and where they are providing a source URI with every feed entry. One of the Feedreader’s sample atom feeds, daringfireball.net, embeds them in the atom:content element of the feed entry. That means they are somewhere in a string (CDATA) section with a ‘href=’ prefix, where you might be lucky enough to pick them up reliably. Another atom feed, from blog.whatwg.org, uses the optional “xml:base” attribute of the content element (with full URIs in contrast to daringfireball.net, which only provides a true base URI in this attribute), but also uses the “atom:link” feature to provide links to the entry’s source, but it does so specifying atom:link twice for every entry – where the one with the “alternate” rel attribute is what you are interested in. (And, in case you wondered, the RFC in general “assigns no meaning to the content (if any) of this element”[*]). daringfireball.net, on the other hand, uses atom:link too, but only to link to articles the blog entry is about, not the entry itself. Go figure!

Comments are closed.