[Reader-list] Indic work report for July & August '2005

Raghavan Kandala raghavan at servelots.com
Thu Sep 8 11:32:43 IST 2005


Hi all,

     Here is our (Surekha and Raghavan) Indic work report for the months 
of July and August '2005.

Raghavan.
--------------------------------------------------------------------------------------------------------------------------------------------
*

I. Morphological Analyser API in Java *(Surekha & Raghavan)

    Morphological Analyser API in Java for Indian Languages is ready. At
    present
    it is being tested for Hindi Language. The algorithm is generic to all
    Indian Languages with some exceptions. But a major dependency is on the
    Lexicon that has the rules and dictionaries essential for Analysis.

    _Algorithm:_

        1. Let 'INPUT_WORD' be the variable that stores the input word
        to be analysed.

        2. Let 'SUFFIX' be the array into which extracted suffixes from
        the input word are added.

        3. Let 'GUESSED_ROOT' be the variable which holds the guessed root.

        4. Repeat the steps [4.1 - 4.5] till length of SUFFIX is equal
        to the length of INPUT_WORD.

        4.1 Extract the last character from the INPUT_WORD and append it
        to the value of the SUFFIX variable.

        4.2 Check if the value in the SUFFIX variable is present in the
        'SUFF_INFO' table.

           (a) If present,
                ADD_DEL_ENTRY = Entry for this SUFFIX in 'SUFF_INFO' table
                ADD_DEL_ENTRY.Reference = Reference
                ADD_DEL_ENTRY.ParadigmCount = Number of paradigms
                ADD_DEL_ENTRY.add = add
                ADD_DEL_ENTRY.suffix = suffix
                GUESSED_ROOT = INPUT_WORD - SUFFIX + ADD_DEL_ENTRY.add

           (b) If not present, GOTO Step:4.1

        4.3 Skip 'ADD_DEL_ENTRY.Reference' number of characters in
        'PDGM_OFFSET_INFO' table.

        4.4 Read 'ADD_DEL_ENTRY.paradigmCount' number of Entries(lines)
        from 'PDGM_OFFSET_INFO' Table.

               PDGM_ENTRY = Entry for this paradigm in
        'PDGM_OFFSET_INFO' table.

        4.5 For each entry in 'PDGM_ENTRY' found from 'PDGM_OFFSET_INFO'
        table

               PDGM_ENTRY.pdgm = pdgm
               PDGM_ENTRY.category = category
               PDGM_ENTRY.offset = offset

               check 'PDGM_ENTRY.pdgm' against 'pdgm' field in
        DICTIONARY and
               'GUESSED_ROOT' against 'word' in DICTIONARY.If a match is
        found,
               then Print the 'PDGM_ENTRY.offset' th line from
        'FEATURE_VALUE' Table.

        5.0 Exit.

    _Functional Details of Morphological Analyser :_

    Morphological Analyser operates in two modes,

    1. Compilation mode

    2. Analyser mode

    _1. Compilation mode:_

        Morphological Analyser makes use of lexicon which includes
        Suffix Information
        files, Paradigm files and dictionaries. These lexicon are stored
        in text files
        whose size varies depending on the words in the language being
        analysed. This
        slows down the analysis as FileReading is an expensive operation
        in terms of
        Time.

        To make this efficient, the lexicon data is read from the text
        files. From
        this data read, objects of SuffixInfo, ParadigmInfo, Dictionary
        and FeatureValue are constructed and stored in respective
        Hashtables. The Hashtable containing
        SuffixInfo objects are hashed on the suffix as key. Other
        Hashtables are hashed
        on the paradigm as the key. The entries in the FeatureValue file
        are stored
        into a Vector.These data structures are constructed and are
        serialized to a
        disk file. This process is called "Data Compilation".

        Any user who wants to add lexicon data to the Morphological
        Analyser has to
        run it in compile mode and specify the lexicon data files as an
        input to them.

    _2. Analyser mode :_

        To analyse the language words, the Morphological Analyser has to
        be executed
        in Analyser mode. In Analyser mode, it makes use of the
        serialised data. It
        first deserialises it, populates the required objects and
        analyses the given
        word using these data structures. It gives all the possible
        outputs of the
        analysis.

    _Detailed Design of Morphological Analyser :_

        Morphological Analyser API has three classes viz, Morph,
        MorphCompiler,
        MorphAnalyser, SuffixEntry, Paradigm, Dictionary, FeatureValue.

        _Morph:_

            This class has the main() method which is used to run the
            morph as an tool. Invoking this class from commandline with
            required arguments will print the output.

            This class implements the Serializable interface because it
            has methods to serialize and deserialize the lexicon data
            required for analysis. The lexicon is a set of files which
            includes a dictionary, a suffix information file for the
            words in that languages, the paradigms that are present in
            that language and a feature value file which has the
            properties of the word given for analysis.

        _MorphCompiler:_

            This class does the lexicon compilation. It has methods to
            read the data from the disk files, construct objects of
            SuffixEntry, Paradigm, Dictionary and FeatureValue,
            construct the Hashtables and Serialise these Hashtables to
            the disk.

            To compile rules, one can call the compileRules() method of
            this class with the lexicon files as arguments from which
            the Morph has to compile rules and serialise these lexicon.

        _MorphAnalyser:_

            This class does the analysis of the given word. To analyse a
            word call the analyse method with the word to be analysed as
            an argument. The result of the analysis is stored in the
            object MorphResult.

        _MorphResult:_

            This class has the methods to access the properties of the
            word given for analysis. This class has the following methods.

            set/getRoot() -- sets/Returns the ROOTWORD for the input word

            set/getGender() -- sets/Returns the GENDER for the input word

            set/getTense() -- sets/Returns the TENSE for the input word

            set/getNumber() -- sets/Returns the number (singular/plural)
            for the input word

            set/getCategory() -- sets/Returns the category(Noun, Verb
            etc) to which this input word belongs to.

            set/getMorphCase() -- sets/Returns the MORPH CASE of the
            input word.

            MORPHCASE - 0 means, Direct case
            MORPHCASE - 1 means, Oblique case

            The classes SuffixEntry, ParadigmEntry, Dictionary and
            FeatureValue are used to store entries in each line in the
            files SUFF_INFO, PDGM_OFFSET_INFO, dict.final and
            featurevalue files respectively.

    _Current Status :_ Testing

*II. Problem with TIMESTAMP field in Mysql Version 4 with the old JDBC 
Drivers: *(Raghavan)

    The data representation in the mysql database for Date and Time
    fieldtypes(till now we encountered) in the Mysql4+ version is
    changed from the
    3.x version and the JDBC Drivers doesn't have support for these new data
    representation and gives strange results or exceptions. Using a
    mysql-3.0.9.jar
    (JDBC Driver for Mysql), we get all the expected output. But if we use
    mysql-3.1.7.jar then, we observed the following two things in our
    pantoto.

    1. The ResultSet.getString() on a DATETIME column will give an
    output with a
    ".0" appended.

    2. when try to retrieve ResultSet.getString() on a DATETIME column
    and if it
    has a value "0000-00-00 00:00:00", then it throws an SQLException
    "Cannot Convert 0000-00-00 00:00:00 to NULL".

    So, the driver is behaving as expected, because it doesn't know what
    to do on
    these situations. To fix these problems, we found an article
    (http://jroller.com/page/mmatthews?entry=connector_j_3_1_upgrade)
    written by
    one of the Mysql JDBC Driver developers. It explains the need for
    change in
    this behavior of the JDBC driver and the work around to fix it. We
    have to add
    "noDatetimeStringSync=true"(work around for prob 1)
    "zeroDateTimeBehaviour=convertToNull" (WA for prob 2)
     to the mysql connection url.

    eg:
    jdbc:mysql://localhost/grounddb?zeroDateTimeBehavior=convertToNull&noDatetimeStringSync=true

    The Parameter autoReconnect:

    The autoReconnect parameter is used to Set the driver to reconnect
    if the
    Mysql Server fails. But this is deprecated as most of the modern
    Connection
    Pool Managing packages offer this facility.

    We are not sure whether wroxpool has this feature incorporated into
    it. So,
    we will use it until we completely shift over to DBCP(hoping that it
    has this
    feature) in the newcodebase and we will continue with this parameter
    as long
    as we have webapps from the oldcodebase.

    Note: using "autoReconnect" with connector/J 3.2 version will throw an
    exception. To fix this use another parameter
    'enableDeprecatedAutoreconnect=true'.

*III. Problems with Pantoto Installation due to Mysql driver version 
*(Surekha)

    We faced two problems with pantoto installation on http://pantoto.org

    1. HTTP 404 error - saying the resource /servlet/pantoto/ not found

    Solution: Uncomment the invoker servlet mapping in web.xml under
    $TOMCAT_HOME/conf/. The purpose of the Invoker Servlet is to allow a web
    application to dynamically register new servlet definitions that
    correspond
    with a element in the/WEB-INF/web.xml deployment descriptor, and execute
    requests utilizing the new servlet definitions.

    2. Unable to connect to the database with urlid pantoto

    Solution:

    1. Gave grant permissions to root for pantotodb
    GRANT ALL privileges on pantotodb.* to 'root'@'%' identified by
    'servelots';

    2. Used this script to test the database connections for the given
    user and
    the host using a mysql driver

      import java.sql.*;
      public class TestMysql
     {
          public static void main(String args[])
         {
               try
               {
                    /* Test loading driver */
                    String driver = "com.mysql.jdbc.Driver";
                    // String driver = "org.gjt.mm.mysql.Driver";
                    System.out.println( "\n=> loading driver:" );
                    Class.forName( driver ).newInstance();
                    System.out.println( "OK" );
                    /* Test the connection */
                    String url = "jdbc:mysql://localhost/pantotodb";
                    System.out.println( "\n=> connecting:" );
                    DriverManager.getConnection( url, "root","servelots" );
                    System.out.println( "OK" );
                }
                catch( Exception x )
                {
                    x.printStackTrace();
                }
        }
    }

    ERROR from the script:

    java.sql.SQLException: Unable to connect to any hosts due to exception:
    java.lang.ArrayIndexOutOfBoundsException:
    40 at com.mysql.jdbc.Connection.createNewIO(Connection.java:1797) at
    com.mysql.jdbc.Connection.(Connection.java:562) at
    com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:361)
    at
    java.sql.DriverManager.getConnection(DriverManager.java:512) at
    java.sql.DriverManager.getConnection(DriverManager.java:171) at
    db_tests.TestMysql.main(TestMysql.java:23)

    Downloaded the latest version of mysql driver from
    http://dev.mysql.com/downloads/connector/j/3.1.html and it worked.

*IV. Synchronization of data in PANTOTO system between a Central 
repository and
**different Local Access Points (LAP)*( Surekha & Raghavan )

    Synchronization of data means updating the data from a Local Access
    Point
    to the Central repository and vice versa. The former is called
    Synchronizing upward and the latter is called Synchronizing downward.

    _Synchronizing upwards :_

        The users at a Local Access Point are allowed to synchronize
        data with the
        central repository with the following assumptions

        1. Only data(pagelet information) can be synchronized or updated
        from the
        Local Access Point to the Central repository. Any other
        information that is
        at the LAP will not get updated into the Central repository.

        2. When updating from a LAP, if any pagelet conflicts with the
        one in the
        central repository, then that pagelet data will not get updated
        on to the
        central repository. This could happen, when a pagelet(also
        existing in
        central repository) is modified at one of the LAPs.

        3. Users at the LAP are not allowed to do the Pantoto
        administrative tasks
        like Changing templates, adding new templates, changing user
        preferences etc.

    _Synchronizing Downwards :_

        1. Synchronizing downwards means, replacing the LAP data with
        the one at the
        central repository.

        2. Any changes(other than pagelet information) made at the LAP
        will be lost
        when synchronized downwards.

    _Technical details on Synchronization :_

        1. At the LAP, whenever new pagelet is created or an existing
        pagelet(also exists in the central repository) is modified, an
        XML file is
        generated with all the new pagelet information in addition to
        the entry in
        the database.

        2. The generated XML file will also have information about
        whether this is a
        new pagelet or an edit of a pagelet that exists in the central
        repository.

        3. When upward Synchronization is initiated, then the script
        transfers all
        the XML files from LAP to the central repository and imports
        these pagelet
        information in to the database with new pagelet ids.

        4. If a pagelet information is found conflicting with the one on
        the central
        repository, then it is marked as conflicted and doesn't get
        imported to the
        central repository.

        5. An LAP can synchronize upwards to another LAP. It need not always
        synchronize upwards to the central repository.

        6. Each LAP has to maintain the information about the central
        repository or
        another LAP with which it synchronizes data.

        7. The Synchronization is done using some third party tools.
        eg: ant based scripts etc.

*V. IndicIME Toolbar:*

    IndicIME Toolbar is now available at Firefox extension website
    http://addons.mozilla.org/extensions/?application=firefox? under the
    languages
    category. It can be installed by choosing Tools --> extensions -->
    Get More Extensions
    and search for Indic.

---------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.sarai.net/pipermail/reader-list/attachments/20050908/1fbe08ec/attachment.html 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: BitDefender.txt
Url: http://mail.sarai.net/pipermail/reader-list/attachments/20050908/1fbe08ec/attachment.txt 


More information about the reader-list mailing list