[Reader-list] NewsRack report for June

Tue Jul 12 19:48:28 IST 2005

Here is my report for the month of June.  -Subbu.

               NewsRack: Report for the month of June
               --------------------------------------
Existing implementation: http://floss.sarai.net/newsrack
Browse News            : http://floss.sarai.net/newsrack/Browse.do

Bug fixes
---------
Continuing to fix bugs and identified new ones.  Fixed some relatively
serious bugs with the implementation of the back end news archive.

New features
------------ 
It is now possible to classify news from the archive.  Thus far, the only way
news would get added to issues and categories was during scheduled news
download.  But, if I added a new category or a new issue, news would get added
to it only from the day it was added.

I have now added a 'Reclassify News' feature by which it is possible to fetch
news from the archive (selectively, or the entire archive) and use that to
populate an issue or a category.  I had to fix some bugs in the back-end
implementation as well as do some minor redesign to enable greater modularity.

The existing technique of classification is somewhat inefficient which shows
up in the long time it takes to reclassify news if the entire archive is
selected.  This is because a lot of file I/O based communication is being done
between different parts of the system.  I will work to reduce some of this
file I/O and make this more efficient.

Student project: improving HTML-to-text filtering
-------------------------------------------------
I had mentioned in my previous posting that an undergrad student was
interested in working on NewsRack.  Jaikishan Jalan from Dhanbad School of
Mines spent about a month's time on a summer project.  I asked him to work on
a standalone project of improving the current HTML-to-text extractor to
eliminate spurious Javascript and other content that was not geting filtered
out.

The way NewsRack works, a news HTML file is first processed to extract the
text content -- this step is quite crucial to improving the accuracy of
classification because most news HTML files have links to several related
articles -- the text from these news titles will signal "false" hits for
several categories.

Jaikishan was unfamiliar with Java and parsing.  So, I spent time with him
over email, yahoo chat (and couple days in-person meeting at Bangalore)
helping him with this as well as some internals of NewsRack.  The HTML-to-text
filter in NewsRack is based on the Swing HTML parser. Jaikishan identified why
the existing HTML-to-text was allowing pieces of Javascript -- this was due to
faulty parsing of Swing. He also identified a new HTML parser
(htmlparser.sourceforge.net) and worked on porting the filter to use this new
parser.

I have since used his background work to fix the problems in the current
filtering.  I have decided to use the Swing HTML parser implementation because
(i) it is over 3 times faster than htmlparser -- I did some benchmarking runs
to determine this (ii) there is no need to distribute an additional jar file
since Swing already comes with JDK.

While I could have spent far less time and energy doing the detective work and
re-implementing this myself, I feel it has been an useful investment since
Jaikishan said he wanted to continue working on NewsRack over the semester at
his University.

Porting backend to MySQL
------------------------
Jaikishan and I identified a project for him -- that of developing a backend
MySQL schema for NewsRack and developing a Java class to work with this.  We
have had some preliminary discussions about the schema design.  This requires
more work and we hope to have this work completed in the next couple months.

The current backend uses XML files and the flat file system of the OS for 
(1) recording user information - id, password, concepts, categories, profiles 
(2) archiving downloaded news and maintaining relevant index files for future
    access, and
(3) for storing information about classified news in various categories.

Some of this is better stored in a database which will also improve the
efficiency as well as reduce the in-memory resource consumption of NewsRack at
the server side.

Comparison with other classification techniques
-----------------------------------------------
I received a suggestion that perhaps it is better to use machine learning and
automatic clustering/classification techniques (Bayesian learning, latent
semantic indexing, etc.) rather than use the current technique of
user-specified rules.

I had meant to respond to this much earlier, but, I lost track of this in
between other work.  I am going to respond to this shortly in a separate mail.  
But, briefly, while the observations are relevant and the suggestions useful,
I think there is still a lot of merit to user-input in this process -- the
whole idea is to identify a subset of the news world that I am interested in
and re-organize that subset in a fashion that makes sense to me.  Some kind of
user input is inevitable in this process -- this has been one of the starting
premises of this project.  But, it should nevertheless be possible to
synthesize this with automated learning processes -- the precise details of
this synthesis requires further work -- thoughts and suggestions are welcome.