From: Jordi Mas (jmas@softcatala.org)
Date: Tue Jul 15 2003 - 14:44:19 EDT
En/na Dom Lachowicz ha escrit:
> For Catalan, you just need to add a list of the 200 or
> so most common *meaningless* words in the language.
> Like:
>
> the, a, an, he, she, of, ...
Hello Dom and the others,
The summarizer should not only have stop words (maningless) it should also
know the most common words in every language.
Well, that I wanted to mention is that if you are doing the selection of the
stop words manually the quality of the summarisation is going to be low and
the algorithm is not going to perform well. If you use Word, you may be
familiar with the concept of not performing well when doing summarisation.
The right way of getting a list of common words for a language is to get a
corpus (colletion of the documents), calculate the relative word frequency
(number of times that the words appears in all documents) and then select the
200 o 300 most common words, then you are going to have exactly that you need.
Also, it would be necessary that the corpus contain texts from different parts
of the human knowdlege.
I know that no every one has a corpus handy, but we should do this with love
at least for the major languages (English, Spanish, German), if not we are not
going to perform well.
I would also suggest to implement another algorism in the library that has
been proof to be effective for text sumarisation. Lots of texts contains words
like "In conclusion", etc, that definitly should have enhance the score of the
sentence and words like "As we said before", "As you already seen" that should
give you less score. This works well, specially for formal texts.
Finally, one common problem in text summarisation is that the selected
sentences assume knowdlege that you may no longer have. For exemple, if you
select "He will do the course with them" or "Also, ..." you no longer have
these references in the text. We can have a list of pronames (pronobres in
Spanish) that if there are present we score lower the setence, because we
prefer first setences with no references to text that we longer no have.
Here my five cents, if you think that some of this is interesting, I can give
you guys a hand, or two :-)
Best Regards,
--Jordi Mas i Hernāndez - Abiword developer - http://www.abisource.com jmas@softcatala.org - Softcatalā member - http://www.softcatala.org - Personal Homepage http://www.softcatala.org/~jmas
This archive was generated by hypermail 2.1.4 : Tue Jul 15 2003 - 14:58:14 EDT