[Reader-list] TRY #3: Fellowship posting #5 & #6 (fwd)

Sun Sep 26 22:31:07 IST 2004

Not sure if there is a problem with mailman and/or there are some special
character combinations is throwing it off, but, the 2nd attempt also got
truncated, it appears (based on checking the reader-list web archive).

I am trying again, this time, after removing the abstract completely.

-Subbu.

---------- Forwarded message ----------

This is a combined posting for the last 2 months, and this marks the 
end of my 6-month fellowship.  I will be applying for an extension
and continue development of NewsRack.  -Subbu.

###########################################################################
       News Rack: Automating News Gathering and Classification
       -------------------------------------------------------

Overview of this posting
- ------------------------
While I had conceived to have a minimal stable system in place by the
middle of August, I will take another 7-10 days before I can make public
the preliminary version of the service.  I am in the midst of testing
out the current user interface and tool.  I will publicise the URL once
it is ready.

In this posting, I will provide:
(1) a very broad overview of the current capabilities of NewsRack
(2) the technologies involved in developing NewsRack
(3) (ongoing) challenges in developing this
(4) future directions for developing this further
(5) summary of visits to NGOs and discussions about news gathering
    and classification work done at those places

If you are not interested in the technical aspects of NewsRack, you can
skip sections 2 and 3.

- -----------------------
1. Current capabilities
- -----------------------
As it stands today, NewsRack can be deployed as a web-based service with
multiple users using the service or as a standalone tool on one's desktop.
Note that this distinction is somewhat cosmetic, because both uses require
a web server over which NewsRack runs.  In the web-based service incarnation,
the installation will be on a web-server that is accessible on the internet.
However, in a standalone-tool incarnation, the web-server serves pages
locally.

1.1 Collaborative development of filtering rules
- ------------------------------------------------
To use NewsRack, a user has to register with the system.  Once registered,
the user has to create a profile for downloading news and classifying it.
This profile (via filtering rules) tells NewsRack (a) what news sources to 
download from (b) what news clippings to select (c) and how to classify the 
selected news items.

The filtering rules are done using a 2-step process of 
(1) specifying keywords and associating them with concepts, and
(2) using concepts to compose rules.
By using concepts (as opposed to keywords) in filtering rules, concept
definitions can evolve over time (new keywords added, useless keywords
removed, etc.) *without* having to modify the filtering rules themselves.
This also keeps the filtering rules simple and easy to understand.  For
example, a filtering rule for dam-rehabilitation category could be:
              [dam] AND [rehabilitation]
where [dam] and [rehabilitation] are concepts.  If tomorrow, governments
have a new rehabilitation policy which is referred in the newspapers
as NRPD (National Rehabilitation Policy for Dams), one could add the
new keyword "NPRD" to the [rehabilitation] concept without modifying 
the filtering rule for the dam-rehabilitation category.  The changes
take effect at all places where this concept is referenced.  This ability
simplifies the maintenance/evolution of the filtering rules over time.

The second interesting feature about NewsRack is that all concept definitions
and filtering rule definitions can be shared and extended.  Thus, a pool of
concepts can be collaboratively developed.  For example, if one user has 
defined the concepts [World Bank], [India], [dams], [privatisation], 
another user who wants to monitor news about world bank projects in india 
can use these concepts without having to redefine them, and if necessary
extend those concepts to suit her needs.

Thus, NewsRack allows for knowledge sharing across users and once a critical
knowledge base is in place, any new user should be able to develop his/her 
profile very quickly.

1.2 RSS vs non-RSS news sources
- -------------------------------
RSS (Rich Site Summary OR RDF Site Summary, depending on the version) is
being widely used to push content from websites to users in contrast to
the earlier model where users visited websites.  With RSS feeds, a user
can subscribe to several news feeds, install a RSS news reader on her
computer, and updates from several websites are available in a single
window without having to visit several different websites.  So, whereas
users visited websites (for website updates) via browsers which understand
HTML, website updates now come to users via RSS readers which understand RSS.
RSS is most pertinent to sites that have frequent updates (like newspapers).

Right now, NewsRack only supports news sources with RSS feeds.  This was
the easiest to develop and test other features of the system.  Once the
current system is deployed, I will work on supporting news sources that
do not provide RSS feeds.  The primary technical challenge here is to
download all news clippings published that day, and to extract date, title,
author information for each clipping.  RSS feeds provide all this information
in a very easy format.  

At this time, Indian Express and Rediff provide RSS feeds.  However, I do not
have full confidence that these feeds cover all news items that are published
on their website.  I am in the process of verifying this.  But, in the future,
all newspapers will likely support RSS feeds.

1.3 Archiving of news clippings
- -------------------------------
All selected news clippings are also archived locally.  This archiving is
done by extracting the text-based content of the clipping and stripping
away everything else.  At this time, while this filtering process works
well, the output can be subject to further "beautification" to remove the 
still-remaining extraneous text not directly associated the news content.

- --------------------
2. Technical details
- --------------------
This section can be skipped if you are not interested in the technical
details of NewsRack.

2.1 Using Struts
- ----------------
NewsRack has been developed using Java using the Servlet technology.
It can be installed on any web server that supporters Java Servlets.
The application has been developed using the Model-View-Controller (MVC)
design pattern.  Thus, there are 3 separate components to NewsRack
(1) the application model with definitions of Users, Concepts, Categories,
    Profiles, News Items, etc.
(2) the view that provides data presentation and user input, and
(3) a controller to dispatch requests and control flow.

I have used the Struts framework of the Apache Jakarta project to implement 
this MVC pattern.  Using Struts has simplified the development of the user 
interface (the V of MVC).  I have used the Velocity templating engine to
develop the various output screens of NewsRack.

The current version has been tested on Resin 2.1.12 web server.  But, 
the system should also run on Tomcat -- I will test this soon to verify.

2.2 Backend archiving
- ---------------------
In the current implementation of NewsRack, all backend archiving of news,
user profiles (including concept definitions and filtering rules) is done
using XML files over the file system provided by the OS.  Thus, the back end
can be seen as a simple XML database.

I have tried to implement the database layer as an abstract interface so
that in future, other backed database implementations like MySQL can be
used (which might become necessary as the system evolves).  Ideally, the
MySQL backend can be implemented without serious changes to the other
components of NewsRack.

2.3 Specification format for the news filter
- --------------------------------------------
The format for specifying concept definitions, filtering rules, news sources,
and user profiles is based on XML.  Initial feedback from people who had
not heard of XML or those who are non-techies has been that they can easily
write these XML-based profile specifications.  At this time, I have not
provided any GUI for developing this XML specifications.  One could use
any simple text editor (like Notepad on Windows or vi/vim/emacs on Unix)
or other XML editors to develop these specifications.

As discussed in Section 1.1, the filtering rules are written using 2-step 
process.  An example here will clarify this process.

      Concept definitions
      -------------------
         <concept>
            <name> dam </name>
            <keyword> dam </keyword>
            <keyword> reservoir </keyword>
            <keyword> mega-dam </keyword>
         </concept>

         <concept>
            <name> narmada </name>
            <keyword> narmada </keyword>
         </concept>

         <concept>
            <name> ssp </name>
            <keyword> sardar sarovar </keyword>
            <keyword> ssp </keyword>
            <keyword> sardar sarovar narmada nigam limited </keyword>
            <keyword> ssnnl </keyword>
         </concept>

      Category definitions
      --------------------
         <category>
            <name> narmada dam </name>
            <rule> narmada AND dam </rule>
         </category>   

         <category>
            <name> sardar-sarovar-dam </name>
            <rule> ssp </rule>
         </category>

Thus, the process of modifying concepts is a matter of adding/deleting
or changing existing keywords for that particular concept.  Filtering
rules are simple boolean expressions composed using AND/OR keywords.
Negation support is still sketchy because the semantics of negation
are not clear in this context.  What does it mean to say (NOT dam), 
for example?  In addition, context-based qualification is also supported.
For example, "maheshwar" could be the name of a person or a temple or a
place.  However, if the article talks of dams or about river narmada,
mention of "maheshwar" could be a reference to the Maheshwar dam!  Likewise,
a reference to Ms.Roy could be a reference to Arundhati Roy if earlier
in the article, there are references to Arundhati Roy.  Context-based
qualifications attempt to capture these scenarios.

2.4 Implementation of news filtering
- ------------------------------------
For every issue that the user has defined, NewsRack examines all the
filtering rules, and collects all concepts that have been used in the
profile.  NewsRack then generates a lexical analyzer (or scanner) to
recognize the keywords for each concept that has been used.  

NewsRack generates a scanner by generating a scanner specification file
for JFlex, a publicly-available Java-based scanner generator.  NewsRack
also supports JavaCC, another publicly-available Java-based scanner
generator.  However, experimentation shows that JFlex generated scanners
are faster and more compact than JavaCC generated scanners.

When a news article is passed through this lexical analyzer, all keywords 
that are encountered trigger the corresponding concepts to be recognized.
By analyzing all concepts that are recognized and their frequency, the
news article is then assigned to one or more categories based on the
filtering rules that match.  At this time, the concept analysis and
rule matching algorithm is somewhat rudimentary and it can be refined
and extended over time.

2.5 Support for RSS feeds
- -------------------------
Currently, I am using the publicly available RSS4J Java API for parsing
RSS news feeds.  This has been downloaded from SourceForge.  Over this,
NewsRack implements caching to prevent downloading the same article
repeatedly for different users, and across different sessions.

- ---------------------
3. Ongoing challenges
- ---------------------
When I started this project, I was not familiar with XML, Java Servlets,
Struts, MySQL, JDBC, or with Java-based web applications.  So, quite a bit
of the last 6 months has been spent getting acquainted with these
technologies, experimenting with them, and proceeding with the development.

3.1 Using XML
- -------------
Some of my lack of experience shows in the XML specification for defining
concepts, rules, news sources, and profiles.  For example, the concepts
defined earlier could become less verbose by using attributes as follows:
            <concept name="dam">
               <keyword val="dam" />
               <keyword val="reservoir" />
               <keyword val="mega-dam" />
            </concept>
While the verbosity of the current specification is not a drawback (I was
told by novice XML users that the attribute-less specification is actually
simpler), future extensions could provide support for these less-verbose
specifications.

3.2 Developing a web application
- --------------------------------
Initially I started using Servlets and Webmacro to implement the user
interface.  However, I later on switched to using Struts to develop
the user interface.  This decision has helped me develop the initial
system much more quickly than would have been possible otherwise.  But,
the user interface development has proved to be much more difficult and
involved than I had imagined when I first began the project.

While I have implemented the back end news archiving as a simple XML
database, I might have to switch to MySQL (or other databases) at a later
point.  At that time, I will familiarize myself with JDBC on a need-to-know
basis.

3.3 Supporting non-RSS based news sources
- -----------------------------------------
On the one hand, while supporting non-RSS seems as simple as downloading
all content for that day from a newspaper's website, things are more
challenging than this.  If I want to save on download bandwidth, I will
have to do more selective downloading.  But, more importantly, in order
to integrate downloaded news within NewsRack, I have to extract date, 
title, author information for each clipping.  While doing this extraction
for any one particular newspaper is a simple matter of writing rules
to recognize these patterns, the harder question is if there is a general
way of extracting this for all newspapers, or if custom patterns would
have to be developed each time a new non-RSS news source is added?  It is
in this respect that RSS acquires added importance.  All this information
is readily available in a RSS feed.  In addition, well-developed RSS
feeds can also provide brief abstracts of the news items which can prove
invaluable in browsing the news archive.

3.4 Managing the news archives
- ------------------------------
Once the tool is up and running for a few weeks, the news collection for
any particular category might continue to grow.  At that time, the challenge
will be in terms of presenting these news items to the user in a way that
does not overwhelm him.  Furthermore, support might have to be provided to
refine the classification system, and reclassify on the fly.

3.5 Bandwidth requirement for downloading news
- ----------------------------------------------
There are a couple of problems with the current monolithic version of 
NewsRack.  Firstly, very few installations will be possible because of 
the bandwidth required to download newspaper content every day.  For example,
with 10 newspapers, it is likely that monthly download might be of the
order of almost 1GB.  It is very likely that only very well-funded 
organizations or organizations/individuals in the US or other developed 
countries could afford the necessary bandwidth.  Public installations
as a web service (as envisaged currently) can help address this problem.

3.6 Challenges with copyright issues
- ------------------------------------
While there can be potential problems with creating local copies of
news clippings without getting permission from newspapers, by highlighting
the not-for-profit motive of this tool, this problem can be avoided.
Perhaps, others can throw more light on this matter.

- ----------------------
4. Further development
- ----------------------
Most immediately, I am working on making public the first release of this
tool which provides the very rudimentary services of filtering for RSS news
feeds.  I am hoping to have this ready within the next 7-10 days.  I will
also put out examples, documentation on the filter specification
process, and in using the tool.

After that the most immediate task at hand will be to fix bugs that will
invariably pop up.  

Once the system stabilizes, I will work on providing support for non-RSS
news sources.  At that time, the tool itself will acquire a semblance of
completeness in terms of covering most of the English-language Indian
newspapers.  The other most important feature that needs to be supported
is the ability to search the news archive.

In parallel, I will work with existing NGOs/users to develop a collaborative
library of concepts, and filtering rules.

Beyond this, there are a number of desirable features that will make the
tool very useful.  I list them here but will not elaborate on them here:
- ability to refine/debug the filtering rules (based on looking at all
    unclassified news articles)
- ability to download and classify news from regional language newspapers
- ability to select news from the archive and automatically generate a
  newsletter based on the selected news clippings
- ability to remove an item from a category or move/copy an item from one
  category to another
- ability to reclassify an existing news archive whenever a profile is
  modified
- ability to sort news items on other axes (like date, author, news source)
- ability to decentralize news downloading across multiple installations
  of NewsRack.  I have ideas about how several installations of NewsRack 
  at multiple sites could collectively download all required news content
  such that any individual installation only downloads a fraction of the
  entire content.  When the tool reaches this stage, it might begin to
  resemble a peer-to-peer model of news downloading.

At some point in the near future, I am going to register the project on 
SourceForge and invite other developers to join in.

- -------------------------------------------------------------------
5. Summary of visits/discussions with NGOs regarding news gathering
- -------------------------------------------------------------------
I have very close experience with this process of news gathering when
I used to maintain the www.narmada.org website for over 2 years.  I still
have a marginal involvement with maintaining the website.  Updating the
press clippings section on this site was one of the most laborious tasks
initially.  At a later time, using 'wget' scripts to download entire content
of newspapers and 'grep'-ping the content for certain keywords, a lot of
this task was simplified, and the manual work came down to about half-an-hour
a day.  Yet, the entire process has been less than satisfactory as can
be seen by the current breakdown in the process of updating the press 
clippings section, several broken links to articles, and lack of any form
of topic-wise/sector-wise classification of articles.

When I spent some time at Environment Support Group, Bangalore, I noticed
that a lot of time used to be spent in this process of collecting news,
marking them, cutting, and filing them.  Besides, there was a perennial
problem of backlog, sometimes resulting in a pile of newspapers that had
to be gone through.

When I visited CED, Bangalore, I found out that they also had a process
for selecting news clippings and filing them.  While they had an electronic
archive, news was added to the electronic archive by downloading articles
(previously marked in the physical version of the newspapers) from the web,
extracting the text content.  There was no automated process here.

I have also had 4-5 people express interest in NewsRack based on my
postings to PRC-list and reader-list.  Shripad from Manthan Adhyayan
Kendra, and Himanshu from SANDRP Delhi or others who have expressed
interest in the tool.  They said that they might use Newsrack if it
proved satisfactory and saves them time.

Most recently, when I was in Delhi and visited Sarai, I was quite amused
to see on the whiteboard something to the effect of: "everyone should
spend at least 3 hours every week scanning relevant news clippings and
adding it to the database".

When I visited CSE, I found out that at CSE, over 80 news publications
are monitored every day, over 500 news clippings are processed, and that
they have about 5-8 full time employees just for this purpose!  I was
told that "news collection is the pain point of many organizations".

The scale of operations here was quite fascinating.  Every day, the news
that was identified was then abstracted to generate a daily news digest
that was circulated within CSE.  Furthermore, the classified news was used
to generate various monthly digests -- called the Green Files, which was
a collection of most relevant and important environment-related news.
The process of selecting news was done on the basis of a keyword thesaurus
that had about 4800 keywords!  This thesaurus has been developed over a
period of almost 20 years.  The thesaurus is the knowledge-base of CSE's
library operations that aids them in selecting news and classifying them
into the various files based on the particular issue that the article
addressed.  The web-people at CSE were curious to see how the tool develops
and felt that it might be useful -- though they were understandably cautious
to see/judge how well the news classification could be automated and how
much work could be saved. 

My gut feeling, based on talking with several people, is that there is
definite value, interest, and curiosity about News Rack.  A lot will
depend on how easy it is to develop the knowledge base, to specify the
filtering rules, and how well the classification performs.  The rule
specification language itself is pretty straightforward and everyone who
has seen the format has felt comfortable with it.  So, the real test will
be in terms of how well the classification performs, and how much effort
is involved in refining filtering rules.  Deployment, experimentation, 
and further refinement of the tool is the way forward from here on.  But,
I am confident that the tool will find its use in this space of news
gathering and classification.