[Reader-list] Indic work report for July & August '2005
Raghavan Kandala
raghavan at servelots.com
Thu Sep 8 11:32:43 IST 2005
Hi all,
Here is our (Surekha and Raghavan) Indic work report for the months
of July and August '2005.
Raghavan.
--------------------------------------------------------------------------------------------------------------------------------------------
*
I. Morphological Analyser API in Java *(Surekha & Raghavan)
Morphological Analyser API in Java for Indian Languages is ready. At
present
it is being tested for Hindi Language. The algorithm is generic to all
Indian Languages with some exceptions. But a major dependency is on the
Lexicon that has the rules and dictionaries essential for Analysis.
_Algorithm:_
1. Let 'INPUT_WORD' be the variable that stores the input word
to be analysed.
2. Let 'SUFFIX' be the array into which extracted suffixes from
the input word are added.
3. Let 'GUESSED_ROOT' be the variable which holds the guessed root.
4. Repeat the steps [4.1 - 4.5] till length of SUFFIX is equal
to the length of INPUT_WORD.
4.1 Extract the last character from the INPUT_WORD and append it
to the value of the SUFFIX variable.
4.2 Check if the value in the SUFFIX variable is present in the
'SUFF_INFO' table.
(a) If present,
ADD_DEL_ENTRY = Entry for this SUFFIX in 'SUFF_INFO' table
ADD_DEL_ENTRY.Reference = Reference
ADD_DEL_ENTRY.ParadigmCount = Number of paradigms
ADD_DEL_ENTRY.add = add
ADD_DEL_ENTRY.suffix = suffix
GUESSED_ROOT = INPUT_WORD - SUFFIX + ADD_DEL_ENTRY.add
(b) If not present, GOTO Step:4.1
4.3 Skip 'ADD_DEL_ENTRY.Reference' number of characters in
'PDGM_OFFSET_INFO' table.
4.4 Read 'ADD_DEL_ENTRY.paradigmCount' number of Entries(lines)
from 'PDGM_OFFSET_INFO' Table.
PDGM_ENTRY = Entry for this paradigm in
'PDGM_OFFSET_INFO' table.
4.5 For each entry in 'PDGM_ENTRY' found from 'PDGM_OFFSET_INFO'
table
PDGM_ENTRY.pdgm = pdgm
PDGM_ENTRY.category = category
PDGM_ENTRY.offset = offset
check 'PDGM_ENTRY.pdgm' against 'pdgm' field in
DICTIONARY and
'GUESSED_ROOT' against 'word' in DICTIONARY.If a match is
found,
then Print the 'PDGM_ENTRY.offset' th line from
'FEATURE_VALUE' Table.
5.0 Exit.
_Functional Details of Morphological Analyser :_
Morphological Analyser operates in two modes,
1. Compilation mode
2. Analyser mode
_1. Compilation mode:_
Morphological Analyser makes use of lexicon which includes
Suffix Information
files, Paradigm files and dictionaries. These lexicon are stored
in text files
whose size varies depending on the words in the language being
analysed. This
slows down the analysis as FileReading is an expensive operation
in terms of
Time.
To make this efficient, the lexicon data is read from the text
files. From
this data read, objects of SuffixInfo, ParadigmInfo, Dictionary
and FeatureValue are constructed and stored in respective
Hashtables. The Hashtable containing
SuffixInfo objects are hashed on the suffix as key. Other
Hashtables are hashed
on the paradigm as the key. The entries in the FeatureValue file
are stored
into a Vector.These data structures are constructed and are
serialized to a
disk file. This process is called "Data Compilation".
Any user who wants to add lexicon data to the Morphological
Analyser has to
run it in compile mode and specify the lexicon data files as an
input to them.
_2. Analyser mode :_
To analyse the language words, the Morphological Analyser has to
be executed
in Analyser mode. In Analyser mode, it makes use of the
serialised data. It
first deserialises it, populates the required objects and
analyses the given
word using these data structures. It gives all the possible
outputs of the
analysis.
_Detailed Design of Morphological Analyser :_
Morphological Analyser API has three classes viz, Morph,
MorphCompiler,
MorphAnalyser, SuffixEntry, Paradigm, Dictionary, FeatureValue.
_Morph:_
This class has the main() method which is used to run the
morph as an tool. Invoking this class from commandline with
required arguments will print the output.
This class implements the Serializable interface because it
has methods to serialize and deserialize the lexicon data
required for analysis. The lexicon is a set of files which
includes a dictionary, a suffix information file for the
words in that languages, the paradigms that are present in
that language and a feature value file which has the
properties of the word given for analysis.
_MorphCompiler:_
This class does the lexicon compilation. It has methods to
read the data from the disk files, construct objects of
SuffixEntry, Paradigm, Dictionary and FeatureValue,
construct the Hashtables and Serialise these Hashtables to
the disk.
To compile rules, one can call the compileRules() method of
this class with the lexicon files as arguments from which
the Morph has to compile rules and serialise these lexicon.
_MorphAnalyser:_
This class does the analysis of the given word. To analyse a
word call the analyse method with the word to be analysed as
an argument. The result of the analysis is stored in the
object MorphResult.
_MorphResult:_
This class has the methods to access the properties of the
word given for analysis. This class has the following methods.
set/getRoot() -- sets/Returns the ROOTWORD for the input word
set/getGender() -- sets/Returns the GENDER for the input word
set/getTense() -- sets/Returns the TENSE for the input word
set/getNumber() -- sets/Returns the number (singular/plural)
for the input word
set/getCategory() -- sets/Returns the category(Noun, Verb
etc) to which this input word belongs to.
set/getMorphCase() -- sets/Returns the MORPH CASE of the
input word.
MORPHCASE - 0 means, Direct case
MORPHCASE - 1 means, Oblique case
The classes SuffixEntry, ParadigmEntry, Dictionary and
FeatureValue are used to store entries in each line in the
files SUFF_INFO, PDGM_OFFSET_INFO, dict.final and
featurevalue files respectively.
_Current Status :_ Testing
*II. Problem with TIMESTAMP field in Mysql Version 4 with the old JDBC
Drivers: *(Raghavan)
The data representation in the mysql database for Date and Time
fieldtypes(till now we encountered) in the Mysql4+ version is
changed from the
3.x version and the JDBC Drivers doesn't have support for these new data
representation and gives strange results or exceptions. Using a
mysql-3.0.9.jar
(JDBC Driver for Mysql), we get all the expected output. But if we use
mysql-3.1.7.jar then, we observed the following two things in our
pantoto.
1. The ResultSet.getString() on a DATETIME column will give an
output with a
".0" appended.
2. when try to retrieve ResultSet.getString() on a DATETIME column
and if it
has a value "0000-00-00 00:00:00", then it throws an SQLException
"Cannot Convert 0000-00-00 00:00:00 to NULL".
So, the driver is behaving as expected, because it doesn't know what
to do on
these situations. To fix these problems, we found an article
(http://jroller.com/page/mmatthews?entry=connector_j_3_1_upgrade)
written by
one of the Mysql JDBC Driver developers. It explains the need for
change in
this behavior of the JDBC driver and the work around to fix it. We
have to add
"noDatetimeStringSync=true"(work around for prob 1)
"zeroDateTimeBehaviour=convertToNull" (WA for prob 2)
to the mysql connection url.
eg:
jdbc:mysql://localhost/grounddb?zeroDateTimeBehavior=convertToNull&noDatetimeStringSync=true
The Parameter autoReconnect:
The autoReconnect parameter is used to Set the driver to reconnect
if the
Mysql Server fails. But this is deprecated as most of the modern
Connection
Pool Managing packages offer this facility.
We are not sure whether wroxpool has this feature incorporated into
it. So,
we will use it until we completely shift over to DBCP(hoping that it
has this
feature) in the newcodebase and we will continue with this parameter
as long
as we have webapps from the oldcodebase.
Note: using "autoReconnect" with connector/J 3.2 version will throw an
exception. To fix this use another parameter
'enableDeprecatedAutoreconnect=true'.
*III. Problems with Pantoto Installation due to Mysql driver version
*(Surekha)
We faced two problems with pantoto installation on http://pantoto.org
1. HTTP 404 error - saying the resource /servlet/pantoto/ not found
Solution: Uncomment the invoker servlet mapping in web.xml under
$TOMCAT_HOME/conf/. The purpose of the Invoker Servlet is to allow a web
application to dynamically register new servlet definitions that
correspond
with a element in the/WEB-INF/web.xml deployment descriptor, and execute
requests utilizing the new servlet definitions.
2. Unable to connect to the database with urlid pantoto
Solution:
1. Gave grant permissions to root for pantotodb
GRANT ALL privileges on pantotodb.* to 'root'@'%' identified by
'servelots';
2. Used this script to test the database connections for the given
user and
the host using a mysql driver
import java.sql.*;
public class TestMysql
{
public static void main(String args[])
{
try
{
/* Test loading driver */
String driver = "com.mysql.jdbc.Driver";
// String driver = "org.gjt.mm.mysql.Driver";
System.out.println( "\n=> loading driver:" );
Class.forName( driver ).newInstance();
System.out.println( "OK" );
/* Test the connection */
String url = "jdbc:mysql://localhost/pantotodb";
System.out.println( "\n=> connecting:" );
DriverManager.getConnection( url, "root","servelots" );
System.out.println( "OK" );
}
catch( Exception x )
{
x.printStackTrace();
}
}
}
ERROR from the script:
java.sql.SQLException: Unable to connect to any hosts due to exception:
java.lang.ArrayIndexOutOfBoundsException:
40 at com.mysql.jdbc.Connection.createNewIO(Connection.java:1797) at
com.mysql.jdbc.Connection.(Connection.java:562) at
com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:361)
at
java.sql.DriverManager.getConnection(DriverManager.java:512) at
java.sql.DriverManager.getConnection(DriverManager.java:171) at
db_tests.TestMysql.main(TestMysql.java:23)
Downloaded the latest version of mysql driver from
http://dev.mysql.com/downloads/connector/j/3.1.html and it worked.
*IV. Synchronization of data in PANTOTO system between a Central
repository and
**different Local Access Points (LAP)*( Surekha & Raghavan )
Synchronization of data means updating the data from a Local Access
Point
to the Central repository and vice versa. The former is called
Synchronizing upward and the latter is called Synchronizing downward.
_Synchronizing upwards :_
The users at a Local Access Point are allowed to synchronize
data with the
central repository with the following assumptions
1. Only data(pagelet information) can be synchronized or updated
from the
Local Access Point to the Central repository. Any other
information that is
at the LAP will not get updated into the Central repository.
2. When updating from a LAP, if any pagelet conflicts with the
one in the
central repository, then that pagelet data will not get updated
on to the
central repository. This could happen, when a pagelet(also
existing in
central repository) is modified at one of the LAPs.
3. Users at the LAP are not allowed to do the Pantoto
administrative tasks
like Changing templates, adding new templates, changing user
preferences etc.
_Synchronizing Downwards :_
1. Synchronizing downwards means, replacing the LAP data with
the one at the
central repository.
2. Any changes(other than pagelet information) made at the LAP
will be lost
when synchronized downwards.
_Technical details on Synchronization :_
1. At the LAP, whenever new pagelet is created or an existing
pagelet(also exists in the central repository) is modified, an
XML file is
generated with all the new pagelet information in addition to
the entry in
the database.
2. The generated XML file will also have information about
whether this is a
new pagelet or an edit of a pagelet that exists in the central
repository.
3. When upward Synchronization is initiated, then the script
transfers all
the XML files from LAP to the central repository and imports
these pagelet
information in to the database with new pagelet ids.
4. If a pagelet information is found conflicting with the one on
the central
repository, then it is marked as conflicted and doesn't get
imported to the
central repository.
5. An LAP can synchronize upwards to another LAP. It need not always
synchronize upwards to the central repository.
6. Each LAP has to maintain the information about the central
repository or
another LAP with which it synchronizes data.
7. The Synchronization is done using some third party tools.
eg: ant based scripts etc.
*V. IndicIME Toolbar:*
IndicIME Toolbar is now available at Firefox extension website
http://addons.mozilla.org/extensions/?application=firefox? under the
languages
category. It can be installed by choosing Tools --> extensions -->
Get More Extensions
and search for Indic.
---------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.sarai.net/pipermail/reader-list/attachments/20050908/1fbe08ec/attachment.html
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: BitDefender.txt
Url: http://mail.sarai.net/pipermail/reader-list/attachments/20050908/1fbe08ec/attachment.txt
More information about the reader-list
mailing list