Re: Lucene for a linguistic corpus

2013-01-08 Thread Grant Ingersoll
Hi Igor, On Jan 5, 2013, at 7:36 AM, Igor Shalyminov wrote: Hello! I'm considering Lucene as an engine for linguistic corpus search. There's a feature in this search: each word is treated as ambiguuos - i.e., it has got multiple sets of grammatical annotations (there's a fixed maximum

Lucene support for multi byte characters : 2.4.0 (version).

2013-01-08 Thread saisantoshi
We are using Lucene (2.4.0 libraries) for implementing search in our application. We are using Standard Analyzer for Analyzer part. Our application has a documents upload feature which lets you upload the documents and be able to put in some keywords (while uploading it). When we search (using

Differences in MLT Query Terms Question

2013-01-08 Thread Peter Lavin
Dear Users, I am running some simple experiments with Lucene and am seeing something I don't understand. I have 16 text files on 4 different topics, ranging in size from 50-900 KB. When I index all 16 of these and run an MLT query based on one of the indexed documents, I get an expected

Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread saisantoshi
DoesLucene StandardAnalyzer work for all the languagues for tokenizing before indexing (since we are using java, I think the content is converted to UTF-8 before tokenizing/indeing)? or do we need to use special analyzers for each of the language. In this case, if a document has a mixed case (

Re: Differences in MLT Query Terms Question

2013-01-08 Thread Jack Krupansky
The term arv is on the first list, but not the second. Maybe it's document frequency fell below the setting for minimum document frequency on the second run. Or, maybe the minimum word length was set to 4 or more on the second run. Are you using MoreLikeThisQuery or directly using

RE: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Paul Hill
The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported to ElasticSearch. Maybe those integrate better. As to not doing some tokenization, I would think an extra tokenizer in you chain would be just the thing. -Paul -Original Message- From:

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Steve Rowe
Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve On Jan 8, 2013, at 6:43 PM, Trejkaz trej...@trypticon.org wrote: On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi saisantosh...@gmail.com wrote:

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Trejkaz
On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe sar...@gmail.com wrote: Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest to you, along with the token filters in that same module. - Steve ICUTokenizer sounds like it's implementing UAX #29, which is exactly the

Re: Cannot instantiate SPI class

2013-01-08 Thread Steve Rowe
Hi Igal, Sounds like you don't have lucene-codecs-4.0.0.jar in Railo's classpath. Steve On Jan 8, 2013, at 10:53 PM, Igal @ getRailo.org i...@getrailo.org wrote: I'm trying to access Lucene4 from Railo (an open-source application server) when I try to create an IndexWriterConfig I get the

Re: Cannot instantiate SPI class

2013-01-08 Thread Igal @ getRailo.org
hi Steve, thanks for your reply. at first I also thought that, so I added lucene-codecs-4.0.0.jar which caused another error, and prompted me to add commons-codec-1.7.jar as well. this error is after I added those two jars, but now I'm thinking -- I added them to Tomcat's classpath (Railo

Re: Cannot instantiate SPI class

2013-01-08 Thread Steve Rowe
Hmm, I don't know. Are you actually using AppendingCodec? For anybody else looking at this: the same exception is listed on https://issues.apache.org/jira/browse/LUCENE-4391 - see also linked issue https://issues.apache.org/jira/browse/LUCENE-4440. Both of these are marked as fixed in Lucene

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Steve Rowe
Dude. Go look. It allows for per-script specialization, with (non-UAX#29) specializations by default for Thai, Lao, Myanmar and Hewbrew. See DefaultICUTokenizerConfig. It's filled with exactly the opposite of what you were describing. ICUTokenizerFactory's customizability has been