[Reader-list] NewsRack: FLOSS fellowship report

Subramanya Sastry sastry at cs.wisc.edu
Mon Oct 24 01:12:05 IST 2005


Hi there,

This is my final (longish) report to mark the end of my 6-month FLOSS
fellowship.  This report relies on some passing familiarity about what
the NewsRack project is about.  If you are unfamiliar, please refer to
the Abstract and Background sections at: 
    http://mail.sarai.net/pipermail/prc/2005-April/002978.html

For potential users,           sections 2, 5, 6 might be of interest.
For potential developers,      I hope you read all of this.
For the technically-minded,    sections 1, 3, 4, 5 might be of interest,
                               but I hope you will read through section 6.
For the sociologically-minded, section 6 might hopefully be of some interest.

Subbu.
###########################################################################
         News Rack: Automating News Gathering and Classification
         -------------------------------------------------------

Brief summary of work
---------------------
Over the last six months, I have primarily worked on:
- fixing some critical bug fixes and making NewsRack stable.
- providing useful new features (output RSS feeds, news classification
  from archives).
- improving the user interface.
- ways to access news sites that do not provide RSS feeds.
- continuing to talk about NewsRack and provide demos at various forums.

Overview of this report
-----------------------
In this report, I will:
1. present an overview of NewsRack
2. present its current status in terms of usability
3. present some technical details about the underlying design
4. present some challenges that need to be addressed
5. present an outline of work in the future
6. discuss the "politics" of NewsRack
7. summarize feedback I have received from various people

--------------------
1. NewsRack Overview
--------------------
In this section, I will provide an overview of NewsRack at a conceptual
level, first by providing a model of the news monitoring problem, and
then, by providing a model of NewsRack's solution.

1.1 Conceptual model of the news monitoring problem
---------------------------------------------------
                  
                  +----------------------+
    ALL           |                #     |
 PUBLISHED  ----> |              #### <--+-----  Desired News
    NEWS          |  ***        #####    |
                  |  *** <------####-----+-----  Desired News
                  |  ***         #       |
                  |        oo            |
                  |        ooo <---------+-----  Desired News
                  |         oo           |
                  +----------------------+

     FIG 1: Conceptual model of the news monitoring problem

If the box represents the universe of all published news, the various
blobs in the box represent the desired news subsets of various users.
So, at this conceptual level, the problem of news monitoring reduces to
- how one discovers published news,
- how users specify what subsets they are interested in, and 
- how these subsets are identified.

1.2 Discovery of published news
-------------------------------
Broadly, there are two approaches (that I know of) to discover published news:
1. One way is to crawl news sites. Ex: GoogleBot, MsnBot, YahooSlurp, etc.
2. The other way is to rely on news feeds. Ex: RSS, Atom.

At this time, NewsRack only uses RSS news feeds to discover the universe
of published news.  This is not going to discover all published news and
is only restricted to those newspapers that provide feeds.  Here is a list
of Indian newspapers that provide RSS feeds at this time: The Hindu,
Hindu Business Line, Telegraph, Times of India, Economic Times, Indian
Express (Front Page only), Express India, Rediff.com, Sify.com, IndiaInfo.com,
NewKerala.com, India Together (the last six are online publications).
The situation has changed significantly in the last year and it is expected
that in the coming year, more newspapers will continue to provide RSS feeds.

However, if implemented, NewsRack could also discover published news via
crawling of newspaper sites.  Section 5.2 addresses this aspect in greater
detail.

1.3 Identifying desired subsets
-------------------------------
There are different ways in which subsets of news can be discovered.
- Automated clustering techniques: Using these, a tool could automatically
  creates different clusters of related news.  Vivisimo (vivisimo.com) is
  a search engine where automated clustering is used to collate search
  results.
- User-guided clustering: Here, a user can provide input to learning
  algorithms (ex: bayesian clustering) by dropping sample news articles
  in different bins.  Thunderbird's spam-detection uses this technique.
- Filtering Rules: Here, users write filtering rules to pick up desired 
  news.  Setting up filters in email clients for routing email to different
  folders is an example.

NewsRack relies on user-specified filtering rules for selecting desired
subset of news.  In Sections 6.2 and 6.5, I will discuss the merits of
the filtering-rule approach (over machine learning techniques).

To summarize, the conceptual model of NewsRack is shown in FIG 2.

                  +--------------+
     INPUT   ---->|              |---->   FILTERED
      NEWS   ---->| NEWS FILTERS |---->   NEWS
     FEEDS   ---->|              |---->   FEEDS
                  +--------------+
(Published News)                       (News Subsets)
             
          FIG 2: NewsRack conceptual model

1.4 User Profiles
-----------------
In NewsRack, the user writes a profile in which (s)he specifies the news
sources to monitor, what news clippings to select and how to classify them. 
I will not dwell at length on the specification format in this report, but
just present the key idea.  

The heart of filtering rules in NewsRack is the idea of a "concept".
Technically, a concept is simply a collection of keywords.  
    Ex: [india] = "india", "hindustan", "bharath"
Semantically, a concept attempts to encapsulate various things:
- Spelling variations
    Ex: [nagapatnam] = "nagapatnam", "nagapattinam", ...
- A conceptual idea: 
    Ex: [rehab] = "rehabilitation", "resettlement", "R&R", ...
- Collections: 
    Ex: [narmada-dams] = "sardar sarovar", "maheshwar", "jobat", ...
- Cross-language ideas: 
    Ex: [water] = "water", "paani" (hindi), "neeru" (kannada), ...
    ("paani" is written in hindi, and "neeru" is written in
     Kannada.  More on this in Section 5.3)
-  ...

This model simplifies filtering rules, eases maintenance, captures
personal perception of an issue, and enables knowledge sharing across
users.  I will elaborate on some of this in Section 6.

---------------------------------------------------
2. Current status of NewsRack in terms of usability
---------------------------------------------------
As it stands today, NewsRack can be deployed as a web-based service with
multiple users using the service or as a standalone tool on one's desktop.
This distinction is somewhat cosmetic, because both uses require a web server
over which NewsRack runs.  In the web-based service incarnation, the
installation will be on a web-server that is accessible on the internet.
However, in a standalone-tool incarnation, the web-server serves pages
locally.

Currently, there is only one public installation at
       http://floss.sarai.net/newsrack

However, its primary drawback is the lack of a friendly/intuitive interface
for non-technical users to specify profiles for news filtering.

During the last 6 months, I have made some progress in identifying approaches
towards designing an user interface that makes it possible to develop
profiles easily, and also to improve collaboration between different users
of the system.

The current version has been tested on Resin 2.1.16 and Tomcat 5 web servers
on Linux and Windows XP systems.  The software has not been tested on other
combinations of web servers and operating systems.

--------------------------------
3. Technical details of NewsRack
--------------------------------
Once again, I will not dwell too much here because much of this has
been covered in my postings over the two terms of my fellowship.

3.1 NewsRack: J2EE application
------------------------------
Briefly, NewsRack is a J2EE application and utilizes the Struts and
VelocityMacro tools.  At this point, the backend relies on the underlying
file system for storage of the news.  News indexes and the user database
is stored in XML files.  This is not a very scalable solution and in future,
this will be converted to a MySQL backend. 

It relies on other FLOSS software besides Struts and Velocity.  It uses
RSSLib4J RSS parsing API, Rome RSS parsing API, and JFlex scanner generator.
In future, one of RSSLib4J or Rome will be retired -- as of April 2005,
neither library provided all the features utilized by NewsRack.

3.2 Implementation of news filtering
------------------------------------
For every issue that the user has defined, NewsRack examines all the
filtering rules, collects all concepts that have been used in the profile,
and then generates a lexical analyzer (or scanner) using JFlex.  Downloaded
news articles are then passed through this scanner to identify all concepts
that are present in the article.  By analyzing these concepts and their
frequency, the news article is then assigned to one or more categories
based on the filtering rules that match.  At this time, the concept analysis
and rule matching algorithm is somewhat rudimentary and it can be refined
and extended over time.

In the earlier stages of development, NewsRack also supported JavaCC,
but at this time, support has been dropped for it because experimentation
showed that JFlex generated scanners are faster and more compact than
JavaCC generated scanners.

3.3 Extracting news content from HTML source of a news item
-----------------------------------------------------------
But, the crucial component for the accuracy of news filtering is the ability
to extract news content from the HTML source of a news item.  This is because,
most news items are always presented on a page where there are links to
other/related news, as well as advertisements (Google ads or others).  If all
this extraneous text (links to other/related news, as well as ads) is not
removed, this text will lead to "false" matches of the news item against
concepts.  So, this component of NewsRack -- extraction of news content from
the HTML source -- is central to the accuracy of classification.  While this
works quite well in general, there are several potential problems.  The first
problem is somewhat cosmetic.  But, the next two problems are more serious
and affect classification.

1. During the process of extracting news content, spurious characters were
   getting through due to a problem with the Swing HTML parser.  Jaikishan
   Jalan, a student from Indian School of Mines, Dhanbad, worked on this
   problem and helped fix these problems.  Additionally, he pointed me to
   a different HTML parser (HTMLParser on Sourceforge) which does a better
   job of parsing HTML pages.

2. Several news items of Times of India are split across pages, i.e. at the
   bottom of the first page, there is a link to navigate to the next page
   of the news item.  The current news-content-extraction technique is not
   smart enough to recognize this break in flow and automatically fetch the
   next page.

3. The current news-content-extraction works by "blindly" throwing away all
   content found as part of anchor tags (<a> .. </a>).  This works well with
   most "conventional" news items wherein there is no hyperlinking in the body
   of the news item.  However, there are also several online newspapers (CNet,
   for example) where there is heavy hyperlinking in the body of the news item.
   The current technique will end up throwing away all this hyperlinked text
   which will then compromise the accuracy of news classification.  So, this
   technique will not work very well across all kinds of HTML pages -- the
   current technique is tailored to work with conventional news sources.  But,
   this is one area where there is scope for improvement to make NewsRack work
   well with different kinds of news sources.

-------------
4. Challenges
-------------
There are continuing challenges both in terms of continuing development
of the tool, and adoption and usage of the tool by others. 

4.1 User interface
------------------
This aspect was covered earlier.  The user interface is better and easier
to use than last year.  It is now possible to examine profile files of other
registered users, copy desired files, and edit them suitably to develop
a new profile.  Such sharing is actively encouraged within NewsRack since
all profiles are default, by public.

However, as explained in Section 2, work is in progress to make this process
more friendly and non-technical.

4.2 Copyright issues
--------------------
While there can be potential problems with creating local copies of
news clippings without getting permission from newspapers, by highlighting
the fair-use policy and the not-for-profit motive of this tool, this
problem can be avoided.

4.3 Bandwidth and disk usage
----------------------------
Right now, across all (24) registered users on http://floss.sarai.net/newsrack,
about 230 RSS feeds are monitored.  The average daily download is of the order
of 40 MB which translates to about 1GB download per month.  

1GB is also the average disk usage per month.  But, once few more details
and bugs are ironed out of the backend, the original HTML files downloaded
from news sites can be deleted bringing down the (uncompressed) storage
requirements to about 100 MB a month.  With compression and on-demand
decompression of most-accessed news items, this disk usage can be
brought down further.

In future, as more newspapers provide RSS feeds, or news sites are crawled,
the bandwidth and disk usage will continue to grow.  At the current usage of
monitoring of about 15 different news sources (with about 230 RSS feeds),
the disk usage would be about 50MB a month.  The more critical resource
usage would be internet bandwidth usage of about 1GB a month.  This will
be the biggest hurdle for this tool to be installed widely by individuals
or small resource-crunched organizations.  However, the resource usage is
not yet exorbitant for larger institutions to install them and provide
access to others over the Internet.  Additionally, the number of news sources
monitored can be adjusted to match resource availability.  

The current design of NewsRack retains the possibility for capability to be
developed for different installations of NewsRack to collaboratively download
content without duplication.  These different installations would then act
as a larger decentralized NewsRack installation.  But, this is still some
time away in the future.  This development would further bring down the
resource usage of any one installation.

4.4 Scalability
---------------
The primary upcoming technical challenge that I can foresee is one of
scalability.  I mentioned that with further refinement of the back-end
disk usage can be managed significantly.  Over the last 6 months, I have
continued to refine the back end to make the process of download and
filtering more efficient than it was.

Additional challenges of scalability will crop up with the management
of the backend archive.  While the current backend relies on XML files
for storing news indexes, this is not a very scalable solution because
of the parsing overheads associated with XML.  A move to a database-backend
like MySQL/Postgres will address this technical problem.

Additionally, all user information is currently maintained "in-memory".
This is not a problem currently.  But, as the number of registered and
active users continues to grow, some degree of redesign is needed to read
in user information "on-demand" from the backend (MySQL/XML-files).

A further challenge of scalability is in terms of navigability of classified
news for categories that have a huge number of classified articles.  While
this is not an issue of resource constraint, it is an issue of accessibility.
This can be addressed by the provision of a Lucene-backed search capability
over classified news -- this search has to work on multiple axes (title,
news source, date, relevance, and other possible metrics) so that a subset of
the classified news can be quickly accessed.
 
--------------
5. Future Work
--------------
There are a number of directions that this project can go, both based on my
own interest/understanding of the utility of this tool, as well as based on
feedback and feature requests I have received from others.

5.1 User Interface
------------------
The primary task will be to develop an user interface to aid creation of
profiles and to facilitate collaboration.  I have a basic design worked
out for this user interface.  I am omitting the technical details of that
design in this email.

5.2 Crawler-based news gathering for sites without RSS feeds
------------------------------------------------------------
I have done some work on writing crawlers for sites that dont provide RSS
feeds.  I spent some time writing a Perl script for developing a generic
crawler.  But, that has not gone very far because newspapers like
'Indian Express', 'Times of India', 'Hindustan Times' use database-backed
sites and the site/URL structure does not reflect news published on
a particular day.  Additionally, the web servers installed on these
newspaper sites do not provide the 'Last-Modified' time which makes it
hard to download news published in the previous 24 hours.

However, it is possible to develop site-specific news crawlers.  I have
a script that works for Hindu.  However, by the time I got this Perl
script in place, Hindu has started providing RSS feeds.  But, this script
is nevertheless useful in gathering news from Hindu's archives.  I have
already used this feature because Hindu does not seem to be updating its
RSS feeds every day.  

Additionally, from a developer point of view, it is possible to write
these site-specific crawlers in languages other than Java (PHP, Perl,
Python, etc.) as long as the news is downloaded in the appropriate
directories, and as long as the index file is written in the proper XML
format.  This provides one clean way for various people to contribute to
NewsRack -- by developing site-specific crawlers for different newspapers.
The developed crawlers can be "plugged into" NewsRack thus making
available news from those sites to users.

5.3. Ability to monitor news in other Indian languages: 
-------------------------------------------------------
Since NewsRack is written in Java, it supports Unicode.  So, it is possible
(as of today) to create profiles where all name and keywords are in Hindi.
In fact, it is also possible to write a concept where each keyword is in
a different language For example, a concept [india] could have the keyword
"india" in english, "hindustan" in hindi, "bharata" in kannada, etc.  Thus,
an english language news item, a hindi language news item, and kannada
language news item can all be classified in the same category.  It is
this capability that demonstrates the real power of concept-based news (as
opposed to keyword-based) news filtering.
   
But, this is unlikely to fly off the ground right away because of two reasons:
(a) I am not sure what encoding is used by different newspapers.
(b) It is unlikely that these papers provide RSS feeds.  So, support for
    these papers also requires non-RSS based news gathering to be in place.

So, this aspect of NewsRack requires an investigation into questions of
encoding, and writing a site-specific crawler for that newspaper. 

5.4 Laundry list of other work to be done
-----------------------------------------
. Providing a MySQL backend
. Investigating issues of scalability and beginning to address them
. Providing the search feature (Lucene indexing)
. Providing features for managing classified news (deleted news,
  adding missing news, ...)
....

-------------------------
6. "Politics" of NewsRack
-------------------------
It is perhaps somewhat of an obvious statement to make that software are
born out of ideas, and that those ideas are rooted in politics.  This is
a playful (but not necessarily frivolous) exercise trying to explicate the
ideas that have shaped the design of NewsRack and the resultant (intended
or unintended) politics of NewsRack as a tool.

6.1 Sharing and Collaboration
-----------------------------
NewsRack incorporates the idea of sharing and collaboration at multiple
levels.

1. Firstly, as a FLOSS application, it is freely sharable, and it is
   possible to collaborate on the development.  Hopefully, the collaboration
   part of this will happen sooner than later :-) I am the sole developer at
   this point.

2. Secondly, at the level of developing profiles, NewsRack is designed to
   encourage collaboration and knowledge-sharing.  All data (in the form of
   concepts, news sources, filtering rules) are made public by default.
   As of today, NewsRack does not yet implement any other sharing policy.
   But, that may change in the future.  Most existing profiles on the
   Sarai NewsRack installation have been developed by copying and suitable
   editing.

3. Thirdly, at the level of NewsRack output, the categorized news is free
   for all.  Anyone visiting (or stumbling into) the site can browse news
   that has been categorized by various users.  This aspect of NewsRack is
   readily exercised all the time.  Taken together with the practice of
   copy-edit profile development, NewsRack can be loosely said to enable
   practices of open content.

4. Fourthly, the current design of NewsRack retains the potentiality of
   easing resource usage of any individual installation by sharing bandwidth
   and disk usage at any one site.  For example, one installation of NewsRack
   can exclusively take care of downloading and classifying news articles
   from Hindu, another installation of NewsRack could handle articles from
   BBC, and so on.  While tapping this potential of NewsRack requires more
   work, but, the potential very clearly exists and has been been consciously
   added keeping in mind the resource requirements of this tool (Sec 4.3).

Thus, the values of openness, sharing, collaboration has been right at the
heart of NewsRack's design -- and this has been a deliberate design from
the beginning.  The primary audience from the beginning has been the
community of NGOs, social activists, academics, and researchers.  From a
very pragmatic point of view (rather than an idealogical point of view),
collaboration and openness is fundamental to this tool being useful for
this class of users.

6.2 "Putting the computer in its place": User-control over news monitoring
--------------------------------------------------------------------------
Ours is an era of increasing automation and computerisation, and it is hard
to deny that NewsRack is very much a part of that trend.

I would like to believe that psychologically, as humans, we like to be in
control of situations, and at certain fundamental levels, automation and
computers seem to hit at that psychological necessity.  This is especially
so with things like articifial intelligence, machine learning, intelligent
agents, and the like.  Movies like "I, Robot" and "The Matrix" (Trilogy)
among others are devoted to the exploration of the conflicts given birth by
"machine intelligence".

To the extent we understand the beasts and can learn to control and direct
them to our ends, we feel comfortable and secure in the knowledge that it is
still us that order around the computers.  But, it is also true there are
also levels of control ("My slave might be your master").  To the programmer
who builds a piece of software, he is very much in control.  But, what about
the user of that software?

One can also better appreciate the appeal of open-source software using the
language of control.  Open-source software holds the promise and potential
of control (even if illusory at times) -- if the piece of software does not
behave, I can bring it to heel by hacking it (up) in a way that I think it
should behave.  Note that I am using this language of control not in terms
of control of corporations or governments (which is also very much an issue)
but at a much more intimate and immediate level: as an assertion of my control
over the seemingly-autonomous program.  Among many things, the culture of
piracy via cracking of licences could be seen as being fed by this need to
establish control over (closed-source) software.

There is an increasing trend of behaviour monitoring to better target
information (and products) to us as consumers.  There is something very
uncanny and disconcerting about the fact that more and more aspects of our
lives are getting digitized which are then digested by software to spit out
the next product that we might be interested in.  This is once again a conflict
over control: deciphering behavioural patterns to turn me into a Pavlov's dog
and control my salivation.  Perhaps then, in some perverse sense, it is a
cause for celebration when some spammer gets through the spam filters :-).

It is in this context that I would like to situate NewsRack.  Going back to
Section 1.3, there are different ways in which desired news subsets can be
identified: there are machine learning techniques and there are techniques
based on user-specified filtering rules.  Right at the beginning, I made the
decision to go with filtering rules to let the user specify what news needs
to be monitored, and how it needs to be classified.  While one could use
machine learning techniques to learn from the user's actions (of dropping
articles in different labelled folders) without the need for any filtering
rules, there is something psychologically comforting about filtering rules
and the fact that the damned piece of software is only following my rules.

Of course, all of this discussion should be rightfully seen as my personal
intellectual, psychological, and political baggage that has influenced my
decision to go for filtering rules (besides reasons like my familiarity
with programming languages and compilers and relative unfamiliarity with
machine learning techniques).  There are also some other reasons for
preferring filtering rules -- more in Sec 6.5 below.

Additionally, this is not to mean that machine learning techniques have no
place here.  By a suitable design, they can be made available as optional
components to be turned on by users according to their taste.  But, I still
see them as being useful in an auxilliary capacity to improve the accuracy
of classification as specified by the filtering rules (perhaps by removing
news items of low relevance, or by clubbing together items that are similar).

6.3 NewsRack Profiles as Open-Source programs
---------------------------------------------
This might seem a bit strange, but, all said and done, profiles with
filtering rules are nothing but programs.  NewsRack profiles are simply
logic programs.  Those familiar with Prolog can recognize the similarity.
Membership of a news article in a category depends on whether a logic
clause (corresponding to the filtering rule for that category) is satisfied.
Let us now revisit the various aspects of NewsRack from this perspective
of "Profiles-as-Programs".

Encouraging copying/sharing while writing profiles is equivalent to writing
open-source programs.  As discussed previously, in the current design of
NewsRack, all profiles are, by default, public.  So, all profiles are
thus open-source programs.

It is very well-known that sharing and re-usability (whether in the
open-source or proprietary world) is greatly facilitated by modular design
and programming.  This is no different in NewsRack and profile development.
NewsRack encourages modularity in a couple of different ways.  The idea of
a "concept" in NewsRack is the first level where modularity is achieved.
By defining a set of related keywords as a concept, it is possible to reuse
concepts in a profile at multiple places, and also by different users.

Secondly, it is possible to define collections of concepts, news sources,
or categories.  The benefit of defining collections is that it is now
possible to share entire collections, and use whatever is desired (rather
than copy one concept at a time).  So, for example, it is possible to define
a collection of concepts that represent different countries, different states,
concepts corresponding to IFIs, concepts corresponding to organic agriculture,
and so on.  This level of modularity has already been leveraged at the level
of collections of news sources.  Collections of news sources get copied
'en masse' and users use only those news sources that they are interested in
monitoring.  However, this power present in modular design is unlikely to be
unleashed without a well-designed user interface (discussed in Section 5.1).

So, to summarize this section, NewsRack enables writing of modular open-source
programs for monitoring news.  However, with a well-defined user interface,
a user need not necessarily realize that (s)he is writing a program in
actuality (which can be daunting to some and a rush for others).

6.4 NewsRack profiles as unravellers of personal politics
---------------------------------------------------------
This is a somewhat unexpected side-effect of writing profiles.  This is best
illustrated with an example.  Consider the following concept which yours
truly wrote while developing a profile to monitor tsunami-related news (partly
as a showcase of NewsRack's abilities).
      <concept>
         <name> economic-impact </name>
         <keyword> economic impact </keyword>
         <keyword> economic devastation </keyword>
         <keyword> slowdown in growth </keyword>
         <keyword> financial loss </keyword>
         <keyword> financial losses </keyword>
         <keyword> tsunami losses </keyword>
         <keyword> months to recover </keyword>
         <keyword> crippling his livelihood </keyword>
         <keyword> crippling their livelihood </keyword>
      </concept>
Firstly, of course, is the matter of what is considered to be an economic
impact.  One can read into this definition accordingly.  But, more glaring
is the last two "keywords".  Those two "keywords" betray a potential gender
bias because of the omission of "crippling her livelihood" (I have of course
fixed this once I discovered this, but it nevertheless stands out).

As another example, here is a concept from another user's profile:
      <concept>
         <name> backwardness </name>
         <keyword> backwardness </keyword>
         <keyword> illiteracy </keyword>
         <keyword> flood </keyword>
         <keyword> backward </keyword>
         <keyword> kosi </keyword>
         <keyword> Kamla </keyword>
         <keyword> Lalu </keyword>
         <keyword> JD(U) </keyword>
      </concept>
Once again, this concept makes explicit one's personal ideas and notions of
what backwardness means.  This is one very good reason to prefer filtering
rules over machine learning techniques.  For one person development could be:
      <concept>
         <name> development </name>
         <keyword> software companies </keyword>
         <keyword> infrastructure </keyword>
         <keyword> highways </keyword>
         <keyword> power plants </keyword>
         <keyword> foreign direct investment </keyword>
      </concept>
For another person, development could be:
      <concept>
         <name> development </name>
         <keyword> land reform </keyword>
         <keyword> organic agriculture </keyword>
         <keyword> decentralization </keyword>
      </concept>
While one could get a sense of this from the article classified into the
various folders by the machine learning algorithm, there is something to be
said for the explicitness of the politics that comes through concepts
and filtering rules.

6.5 NewsRack as a media analysis tool
-------------------------------------
Closely linked with the previous section (concepts as unravellers of personal
politics) is the idea of concepts as an unraveller of media politics.  

At a very crude level, for any particular issue, the number of articles in
that particular category is a good indication of how much attention the media
is giving that issue.  A quick glance at the main listing of users and issues
at "http://floss.sarai.net/newsrack/Browse.do" shows very clearly that the
India-Pakistan issue and the Tsunami issue have hogged the most media
attention amongst all the issues that NewsRack users have written rules for.

One can then dig into the next level and look at the specific categories
within that issue and see how the coverage has been for that issue.  It is
also possible to do an analysis between different news publications and see
how they cover the same issue.  It is possible to do an analysis over time
to see how the coverage has been over time (especially for issues like the
tsunami disaster).

Of course, before one jumps to pass judgments about newspapers, it is also
important to examine the specific filtering rules used by the user.  They 
can be analysed to see how things change by tinkering with the rules.
Thus, the filtering rules provide a very concrete basis for verification
of claims about media coverage.

As a very specific example, I have created a profile for monitoring news
about water privatisation.  Initially, I had some straightforward rules
to catch "water", "privatisation" (and their spelling and tense variants).
However, I was told that this is insufficient and that keywords need to be
chosen more carefully.  To quote:
    just water+privatisation will not suffice. Indeed, due to
    the political backlash to privatisation, many agencies do not
    even use that word. So you will find privatisation news where
    "private" word may not figure; instead, you may have water+BOOT,
    or BOT; or, management contract, or lease, or PPP (which expands to
    Public Private Partnership, but they may just use the short form),
    of PSP (Private Sector Participation), or even the more insidious
    "reforms", "restructuring" etc.
Of course, I modified the profile accordingly.  But what this exchange also
indicates is that it is possible to monitor how different newspapers report
about an issue by analysing the concepts/keywords that matched in the
classified news items.  Thus, by choosing keywords/concepts/rules
appropriately, they can be put to use as unravellers of politics of
media coverage.

Here again, the use of filtering rules (as opposed to machine learning)
comes in handy.

----------
7. Summary
----------
I have received good feedback about NewsRack.  There is interest in its 
potential utility in a number of people, especially researchers.  

I have received feedback that the fact that news is categorized and stored
all in one place, rather than being emailed as news alerts, that itself
is of some value because of the email deluge and because information received
in emails is not easy to manage.

I will end this report on this note.  I will continue to work on NewsRack and
welcome input into this process from anyone so interested.  I especially
invite technical people to come forward and participate in taking forward
this tool.

It is still somewhat early to say how useful NewsRack will be or how widely
it will be adoptd, but hopefully it will in the course of the next several
months as things improve.

---------------
Acknowledgments
---------------
- Sarai, the institution, for providing me with the fellowship
- Everyone at Sarai for their feedback and vote of confidence in this project
- Mary, Vivek, and Dinesh for their technical inputs and support
- ESG, CED, and CSE for their input about their news monitoring processes
- several others who have used NewsRack or seen presentations/demos about
  NewsRack and given me feedback about it



More information about the reader-list mailing list