[Reader-list] Monthly Posting

S Subramanya Sastry sastry at cs.wisc.edu
Mon May 24 16:37:17 IST 2004


Hello all,

This is my second monthly posting about the work I am doing for the FLOSS
independent fellowship (first posting on reader-list).

Subbu.

#############################################################################   
      
          AUTOMATING NEWS GATHERING AND CLASSIFICATION
          --------------------------------------------

Abstract of proposal:
---------------------
Several organizations in the social development sector monitor news that is
relevant to their work.  This is a time-consuming and laborious process for
some groups, especially when the news is monitored, marked, cut, and filed
using hard copies of newspapers and magazines.  Prior experience with the
press clippings page on www.narmada.org indicates that some of this work can
be automated.  This simplifies the task of news monitoring and also saves
time.

This project attempts to automate news monitoring, and aims to provide tools
for classifying, filing, and long-term archiving of news.  The project will
deliver a tool that can be installed, and will also provide all the same
services on a website for those who do not want to (or cannot) install the
tool.

Current status of project
-------------------------
There are 2 distinct pieces of the problem.  The first problem is that of news
gathering.  The second problem is that of filtering news and filing them into
user-specified categories.  The second problem is the more difficult one.  In
order to experiment with ideas related to news filtering and filing, I have
begun experimenting with an already-existing archive of narmada news.  This
archive is available on http://www.narmada.org/pressclippings.html  Using
this archive, I am experimenting with techniques for automatically classifying
them into categories.  In the rest of this posting, I will describe this.

Example classification structure
--------------------------------
Let us suppose I want to classify news in the narmada pressclippings section
into several narmada categories shown as a tree below:
       narmada dams
          -> sardar sarovar dam
          -> maheshwar dam
          -> other narmada dams
       other dams
          -> tehri dam
          -> koel karo dam
          -> other non-narmada dams
       rehabilitation issues
          -> narmada-specific rehab-issues
          -> tehri-specific rehab-issues
          -> other-dams rehab-issues
       financial issues
          -> project costs
          -> misuse of funds
          -> international institutional funders
          -> indian institutional funders
          -> corporate funding
       narmada court cases
          -> narmada judgements
          -> contempt case
       alternatives
          -> water harvesting
          -> other narmada-specific alternatives
       global linkages
          -> world commission on dams
          -> world bank funded dams

Defining news filters for the above category structure
------------------------------------------------------
As a user of this news filtering and filing system, I will have to specify
rules to classify news into these categories.  For example, let us take the
case of the 'narmada-specific rehab-issues' category.  One possible rule
could be:
       Add article to this category if it contains
            "narmada" and "rehabilitation"
However, the article might never use the term rehabilitation.  It might talk
about "cash compensation" or it might talk about "R&R".  The rule has to
capture all these cases.  So, if we wanted to capture these cases, the
rule will now become:
       "narmada" and ("rehabilitation" or "R&R" or "cash compensation")
However, there are several other possibilities.  The article might discuss
the plight of "project affected people", or about "PAFs" or about "displaced
people" or about "oustees" and so on.  In this case, the rule will continue
to grow and become unwieldy.  The solution here is to recognize that the 
intent behind specifying the first rule was to capture the concept of 
rehabilitation and the concept of a narmada dam.  The specific phrases
used might vary from source to source, author to author, and article to
article.  So, if somehow the system were able to recognize these concepts,
the rule will simplify to:
       [narmada-dam] and ([rehabilitation] or [displacement])
where [narmada-dam] represents any possible phrase used to talk about a
narmada dam, [rehabilitation] represents any possible phrase used to
talk about rehabilitation, and [displacement] represents any possible
phrase used to talk about displacement issues.

However, there is no way for the system to know all the phrases that
are represented by a concept.  The user has to specify this separately.
Thus a different file could specify that the concept [rehabilitation]
covers all the following phrases: "R&R", "R and R", "rehabilitation",
"resettlement", "rehabilitation and resettlement", "cash compensation",
"land-for-land", "master plan", "resettlement and rehabilitation", "NWDTA".

2-step process of defining news filters
---------------------------------------
Therefore, the process of specifying news filters is a 2-step process:
STEP 1. Concepts are defined along with all the phrases that the concept
        represents.
STEP 2. The concepts are used to compose news filtering rules.

Besides making the news filtering rules clean and easy to understand and
modify, the concepts defined in STEP 1 could be used by others.  That is,
if the concepts I define are made available publicly, others could reuse
those concepts directly without having to go through the trouble of
redefining them.

Current status
--------------
At this point, I have developed code to read in concept definitions and
category definitions (which together specify the necessary news filters).
Currently, I am using XML as the specification language.  At a later time,
I will develop a graphical user interface (GUI) to specify these news filters.
The GUI will then generate the necessary XML news filters.

I have been experimenting with these news filters on the narmada news
archive.  On the basis of these experiments, I have been modifying the
news filter definitions as well as fine-tuning the news filtering
technique I have implemented.

In the interest of keeping this posting small enough and readable, I have
not included the news filter specification I am using for narmada news.
On request, I will available the concept definitions, category definitions,
or the XML DTD definitions for the news filters (for the technically minded).

Next steps
----------
In the coming month, I will work on developing a news gathering tool.
I will start with news sources that provide RSS feeds and then move on
to news sources without RSS feeds.  I will also work on filing the
filtered and classified news into Pantoto (http://www.pantoto.com)
which uses a MYSQL as its underlying database.




More information about the reader-list mailing list