Re: Announce: LibOTS 0.3.0 released

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Tue Jul 15 2003 - 14:49:36 EDT

  • Next message: Marc Maurer: "Commit: illegal memory access bug in the piecetable"

    Jordi,

    If you want to talk about Libots' design, please talk
    with Nadav et. al. off-list. I was just telling you
    what we needed to do for a Catalan OTS dictionary to
    work, because I thought that was what you had asked
    about.

    I can't work much on OTS' algorithms or designs, as I
    work for a company that produces a similar (though
    much much more powerful) product. I don't want any
    accusations of wrongdoing.

    Also, I'm subscribed to the list. Please don't CC me.

    Dom

    --- Jordi Mas <jmas@softcatala.org> wrote:
    > En/na Dom Lachowicz ha escrit:
    > > For Catalan, you just need to add a list of the
    > 200 or
    > > so most common *meaningless* words in the
    > language.
    > > Like:
    > >
    > > the, a, an, he, she, of, ...
    >
    > Hello Dom and the others,
    >
    > The summarizer should not only have stop words
    > (maningless) it should also
    > know the most common words in every language.
    >
    > Well, that I wanted to mention is that if you are
    > doing the selection of the
    > stop words manually the quality of the summarisation
    > is going to be low and
    > the algorithm is not going to perform well. If you
    > use Word, you may be
    > familiar with the concept of not performing well
    > when doing summarisation.
    >
    > The right way of getting a list of common words for
    > a language is to get a
    > corpus (colletion of the documents), calculate the
    > relative word frequency
    > (number of times that the words appears in all
    > documents) and then select the
    > 200 o 300 most common words, then you are going to
    > have exactly that you need.
    > Also, it would be necessary that the corpus contain
    > texts from different parts
    > of the human knowdlege.
    >
    > I know that no every one has a corpus handy, but we
    > should do this with love
    > at least for the major languages (English, Spanish,
    > German), if not we are not
    > going to perform well.
    >
    > I would also suggest to implement another algorism
    > in the library that has
    > been proof to be effective for text sumarisation.
    > Lots of texts contains words
    > like "In conclusion", etc, that definitly should
    > have enhance the score of the
    > sentence and words like "As we said before", "As you
    > already seen" that should
    > give you less score. This works well, specially for
    > formal texts.
    >
    > Finally, one common problem in text summarisation is
    > that the selected
    > sentences assume knowdlege that you may no longer
    > have. For exemple, if you
    > select "He will do the course with them" or "Also,
    > ..." you no longer have
    > these references in the text. We can have a list of
    > pronames (pronobres in
    > Spanish) that if there are present we score lower
    > the setence, because we
    > prefer first setences with no references to text
    > that we longer no have.
    >
    > Here my five cents, if you think that some of this
    > is interesting, I can give
    > you guys a hand, or two :-)
    >
    > Best Regards,
    > --
    >
    > Jordi Mas i Hernāndez - Abiword developer -
    > http://www.abisource.com
    > jmas@softcatala.org - Softcatalā member -
    > http://www.softcatala.org
    > - Personal Homepage
    > http://www.softcatala.org/~jmas
    >
    >
    >
    >

    __________________________________
    Do you Yahoo!?
    SBC Yahoo! DSL - Now only $29.95 per month!
    http://sbc.yahoo.com



    This archive was generated by hypermail 2.1.4 : Tue Jul 15 2003 - 15:03:00 EDT