Re: Announce: LibOTS 0.3.0 released

From: Dom Lachowicz (domlachowicz@yahoo.com)
Date: Tue Jul 15 2003 - 14:49:36 EDT

Next message: Marc Maurer: "Commit: illegal memory access bug in the piecetable"

Previous message: Jordi Mas: "Re: Announce: LibOTS 0.3.0 released"
In reply to: Jordi Mas: "Re: Announce: LibOTS 0.3.0 released"
Next in thread: Karl Ove Hufthammer: "Re: Announce: LibOTS 0.3.0 released"
Next in thread: Andrew Dunbar: "Re: Announce: LibOTS 0.3.0 released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Jordi,

If you want to talk about Libots' design, please talk
with Nadav et. al. off-list. I was just telling you
what we needed to do for a Catalan OTS dictionary to
work, because I thought that was what you had asked
about.

I can't work much on OTS' algorithms or designs, as I
work for a company that produces a similar (though
much much more powerful) product. I don't want any
accusations of wrongdoing.

Also, I'm subscribed to the list. Please don't CC me.

Dom

--- Jordi Mas <jmas@softcatala.org> wrote:
> En/na Dom Lachowicz ha escrit:
> > For Catalan, you just need to add a list of the
> 200 or
> > so most common *meaningless* words in the
> language.
> > Like:
> >
> > the, a, an, he, she, of, ...
>
> Hello Dom and the others,
>
> The summarizer should not only have stop words
> (maningless) it should also
> know the most common words in every language.
>
> Well, that I wanted to mention is that if you are
> doing the selection of the
> stop words manually the quality of the summarisation
> is going to be low and
> the algorithm is not going to perform well. If you
> use Word, you may be
> familiar with the concept of not performing well
> when doing summarisation.
>
> The right way of getting a list of common words for
> a language is to get a
> corpus (colletion of the documents), calculate the
> relative word frequency
> (number of times that the words appears in all
> documents) and then select the
> 200 o 300 most common words, then you are going to
> have exactly that you need.
> Also, it would be necessary that the corpus contain
> texts from different parts
> of the human knowdlege.
>
> I know that no every one has a corpus handy, but we
> should do this with love
> at least for the major languages (English, Spanish,
> German), if not we are not
> going to perform well.
>
> I would also suggest to implement another algorism
> in the library that has
> been proof to be effective for text sumarisation.
> Lots of texts contains words
> like "In conclusion", etc, that definitly should
> have enhance the score of the
> sentence and words like "As we said before", "As you
> already seen" that should
> give you less score. This works well, specially for
> formal texts.
>
> Finally, one common problem in text summarisation is
> that the selected
> sentences assume knowdlege that you may no longer
> have. For exemple, if you
> select "He will do the course with them" or "Also,
> ..." you no longer have
> these references in the text. We can have a list of
> pronames (pronobres in
> Spanish) that if there are present we score lower
> the setence, because we
> prefer first setences with no references to text
> that we longer no have.
>
> Here my five cents, if you think that some of this
> is interesting, I can give
> you guys a hand, or two :-)
>
> Best Regards,
> --
>
> Jordi Mas i Hernàndez - Abiword developer -
> http://www.abisource.com
> jmas@softcatala.org - Softcatalà member -
> http://www.softcatala.org
> - Personal Homepage
> http://www.softcatala.org/~jmas
>
>
>
>

__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

Next message: Marc Maurer: "Commit: illegal memory access bug in the piecetable"
Previous message: Jordi Mas: "Re: Announce: LibOTS 0.3.0 released"
In reply to: Jordi Mas: "Re: Announce: LibOTS 0.3.0 released"
Next in thread: Karl Ove Hufthammer: "Re: Announce: LibOTS 0.3.0 released"
Next in thread: Andrew Dunbar: "Re: Announce: LibOTS 0.3.0 released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.4 : Tue Jul 15 2003 - 15:03:00 EDT