Thank you very much Daniel!
Maria
Daniel Alheiros wrote:
If you do want more stopwords sources, there is this one too:
http://snowball.tartarus.org/algorithms/
And I would go for the language identification and then I would apply the
proper set.
Cheers,
Daniel
On 18/10/07 16:18, "Maria Mosolova" <[EMAIL PROTECTED]> wrote:
Thanks a lot Peter!
Maria
On 10/18/07, Binkley, Peter <[EMAIL PROTECTED]> wrote:
There's code in Nutch to identify the language of a given text:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/lang/La
nguageIdentifier.html .
Peter
-----Original Message-----
From: Maria Mosolova [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 18, 2007 8:48 AM
To: solr-user@lucene.apache.org
Subject: Re: multilingual list of stopwords
Thanks a lot to everyone who responded. Yes, I agree that eventually we
need to use separate stopword lists for different languages.
Unfortunately the data we are trying to index at the moment does not
contain any direct country/language information and we need to create
the first version of the index quickly. It does not look like analyzing
documents to determine their languge is something which could be
accomplished in a very limited timeframe. Or am I wrong here and there
are existing analyzers one could use?
Maria
On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
Also "die" in German and English. --wunder
On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
One example that I'm familiar with: words "is" and "by" in English
and in Swedish. Both words are stopwords in English, but they are
content words in Swedish (ice and village, respectively). Similarly,
"till" in Swedish is a stopword (to, towards), but it's a content
word in English.
http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.