Bayesian filtering of RSS feeds – can you automatically find interesting journal articles?


In Aggregating sources for academic research in a web 2.0 world, I wrote about keeping up with your research using RSS feeds from

traditional databases (citation alerts, table of contents of favourite journals), library opac feeds of searches and new additions, book vendor sites (e.g Amazon) book sharing sites (e.g LibraryThing), social bookmarking sites both generic (e.g. Delicious) and research 2.0 sites (e.g. citeulike), Google alerts and more

The main problem with this of course is that you quickly get overwhelmed with results. In many cases you can’t create a custom RSS feed (e.g. Many libraries provide RSS feeds of “new additions” in broad subject areas like Economics) and even in instances where you can , say a EBSCOHOST database search in RSS, even the most finely tuned search query can often bring up quite a lot of irrelevant results.

The answer is of course filtering. Bayesian filtering has proven very successful in categorizing mail into good mail and spam, but it can be generalized to  classify text into an arbitrary number or type of categories.

Can one do the same on RSS feeds? In particular RSS feeds from Table of contents from journals? The idea is for the bayesian filter to learn what words tend to occur in articles (abstracts rather) you find interesting, and classify them into “interesting” and not “interesting”

I’m aware of 3 services that do bayesian filtering of RSS feeds. 2 are web commercial services (FeedZero and  Feedscrub) and one is a open source project (SuxOr).

For longer more rambling posts see my more detailed blog post here

Leave a Reply