Re: Analyzers and multiple languages (language detection)

2006-11-21 Thread Bob Carpenter
Antony Bowesman wrote: Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the docu

Re: Analyzers and multiple languages

2006-10-13 Thread Erik Hatcher
On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote: I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyone looked at or is using language analysis to determine the language of a document

Re: Analyzers and multiple languages

2006-10-13 Thread Soeren Pekrul
Hello Antony, I have a similar problem. My collection contains mainly German documents, but some in English and few in French, Spain and Latin. I know that each language has its own stemming rules. Language detection is not my domain. But I can imagine it could be possible to detect the lang

Re: Analyzers and multiple languages

2006-10-13 Thread Mark Miller
Generally, stemming is not a method for index size reduction even though that might be a side effect. It is very useful in search however...you would generally want a search for skiing to also hit ski and skier (I can't spell so don't get caught up on that). There are lots of those examples...if y

Re: Analyzers and multiple languages

2006-10-13 Thread Erick Erickson
This won't be *really* helpful, but I remember this being discussed at some length a while ago. You'd be able to see some good info if you searched the list archive, probably for language I didn't pay much attention since this isn't something I'm concerned with lately, so I can't be much real hel

Analyzers and multiple languages

2006-10-13 Thread Antony Bowesman
Hello, I'm new to Lucene and wanted some advice on analyzers, stemmers and language analysis. I've got LIA, so have read it's chapters. I am writing a framework that needs to be able to index documents from a range of languages where just the character set of the document is known. Has anyo