[Reader-list] Fellowship posting #3

Subramanya Sastry sastry at cs.wisc.edu
Sun Jun 27 10:42:23 IST 2004


My third posting on my FLOSS project is enclosed below.  Comments, questions,
or suggestions welcome. -Subbu.


       News Rack: Automating News Gathering and Classification
       -------------------------------------------------------

Abstract:
---------
Several organizations in the social development sector monitor news that is
relevant to their work.  This is a time-consuming and laborious process for
some groups, especially when the news is monitored, marked, cut, and filed
using hard copies of newspapers and magazines.  Prior experience with the
press clippings page on www.narmada.org indicates that some of this work can
be automated.  This simplifies the task of news monitoring and also saves
time.

This project attempts to automate news monitoring, and aims to provide tools
for classifying, filing, and long-term archiving of news.  The project will
deliver a tool that can be installed, and will also provide all the same
services on a website for those who do not want to (or cannot) install the
tool.

Current status of project (tentatively called News Rack)
--------------------------------------------------------
In my previous posting, I discussed how news filtering can be automated
via a 2-step process of specifying filters for classifying news into
categories of interest.  I used the narmada issue as an example to
illustrate the process.

Since then, I have been working on three different threads.
(1) Integrating RSS news feeds with the news filtering mechanisms
(2) Developing back-end database interfaces for archiving the news
(3) Developing a browser-based user interface for News Rack.

Integrating RSS news feeds
--------------------------
RSS (Rich Site Summary OR RDF Site Summary, depending on the version) is
being widely used to push content from websites to users in contrast to
the earlier model where users visited websites.  With RSS feeds, a user
can subscribe to several news feeds, install a RSS news reader on her
computer, and updates from several websites are available in a single
window without having to visit several different websites.  So, whereas
users visited websites (for website updates) via browsers which understand
HTML, website updates can come to users via RSS readers which understand RSS.

Currently, among all Indian newspapers available on the web, only Rediff News
and Indian Express appear to provide RSS feeds.  If any of you know of other
Indian newspapers providing RSS feeds, it will be helpful to me to know.

Currently, I have integrated RSS news feeds into News Rack.  I am currently
using the publicly available RSS4J Java API for parsing RSS news feeds.
With this integration, News Rack parses a RSS feed, downloads the individual
news items, extracts text content from the HTML, and passes the news item
through the news filtering mechanism that I described in my previous posting.
The filtering mechanism selects relevant news items and classifies them
into user-defined categories.

Newspapers without RSS news feeds
---------------------------------
Obviously, there are still a vast majority of papers that do not provide
RSS feeds.  For these websites, I will develop an utility that downloads
news items from these newspaper websites.  I will work on this in the
coming month after I have several other pieces in place (the archival
mechanism and user interface described below).

Back-end news archiving mechanism
---------------------------------
I have also been developing a generic database interface so that the tool
can work with MySQL, with Pantoto (www.pantoto.com), or with a regular file
system of the underlying OS.  News that is downloaded, filtered, and
classified is archived using these database interfaces.

At present, I am developing a flat-file archival system.  Once I have a stable
system in place, I will develop a archival interface for using Pantoto.

Browser-based user interface
----------------------------
Thus far, I have only described the different pieces that make up News Rack
for a single-user system.  A single user can set up a user profile by
specifying news sources, concept definitions, and filtering rules for
news categories.

However, there needs to be a user subscription management system in place
when there are multiple users of the system.  Currently, I am developing
this interface using HTML and Java Servlets.  By registering with News Rack,
users will be provided with personal user space for monitoring news.  These
users can choose to make available publicly the news archives they generate.
News Rack will also provide a facility wherein casual visitors can search
or browse through news archives that registered users have made public.

Next steps
----------
In the coming month, I will be focusing on two parts of News Rack:
- Developing the news archival mechanism.
- Developing the user interface and the user subscription management system.

At this time, user profiles can only be specified via XML files.  While the
format is pretty straightforward, there is nevertheless a necessity for
developing a graphical interface for specifying these.

Once I have a minimal, but complete, News Rack system in place, I will
focus on:
- Interface for archiving news within Pantoto and MySQL
- Downloading news from newspapers without RSS feeds
- A GUI-based interface for specifying user profiles

I will make available online the News Rack system once the minimal system
is developed and stable.  This should be available by mid-August, if not
by late July.





More information about the reader-list mailing list