Hi Igor,
On Jan 5, 2013, at 7:36 AM, Igor Shalyminov wrote:
Hello!
I'm considering Lucene as an engine for linguistic corpus search.
There's a feature in this search: each word is treated as ambiguuos - i.e.,
it has got multiple sets of grammatical annotations (there's a fixed maximum
We are using Lucene (2.4.0 libraries) for implementing search in our
application. We are using Standard Analyzer for Analyzer part.
Our application has a documents upload feature which lets you upload the
documents and be able to put in some keywords (while uploading it). When we
search (using
Dear Users,
I am running some simple experiments with Lucene and am seeing something
I don't understand.
I have 16 text files on 4 different topics, ranging in size from 50-900
KB. When I index all 16 of these and run an MLT query based on one of
the indexed documents, I get an expected
DoesLucene StandardAnalyzer work for all the languagues for tokenizing before
indexing (since we are using java, I think the content is converted to UTF-8
before tokenizing/indeing)? or do we need to use special analyzers for each
of the language. In this case, if a document has a mixed case (
The term arv is on the first list, but not the second. Maybe it's document
frequency fell below the setting for minimum document frequency on the
second run.
Or, maybe the minimum word length was set to 4 or more on the second run.
Are you using MoreLikeThisQuery or directly using
The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and
it has been ported to ElasticSearch. Maybe those integrate better.
As to not doing some tokenization, I would think an extra tokenizer in you
chain would be just the thing.
-Paul
-Original Message-
From:
Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of
interest to you, along with the token filters in that same module. - Steve
On Jan 8, 2013, at 6:43 PM, Trejkaz trej...@trypticon.org wrote:
On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi saisantosh...@gmail.com wrote:
On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe sar...@gmail.com wrote:
Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of
interest to you, along with the token filters in that same module. - Steve
ICUTokenizer sounds like it's implementing UAX #29, which is exactly
the
Hi Igal,
Sounds like you don't have lucene-codecs-4.0.0.jar in Railo's classpath.
Steve
On Jan 8, 2013, at 10:53 PM, Igal @ getRailo.org i...@getrailo.org wrote:
I'm trying to access Lucene4 from Railo (an open-source application server)
when I try to create an IndexWriterConfig I get the
hi Steve,
thanks for your reply. at first I also thought that, so I added
lucene-codecs-4.0.0.jar which caused another error, and prompted me to
add commons-codec-1.7.jar as well.
this error is after I added those two jars, but now I'm thinking -- I
added them to Tomcat's classpath (Railo
Hmm, I don't know.
Are you actually using AppendingCodec?
For anybody else looking at this: the same exception is listed on
https://issues.apache.org/jira/browse/LUCENE-4391 - see also linked issue
https://issues.apache.org/jira/browse/LUCENE-4440. Both of these are marked
as fixed in Lucene
Dude. Go look. It allows for per-script specialization, with (non-UAX#29)
specializations by default for Thai, Lao, Myanmar and Hewbrew. See
DefaultICUTokenizerConfig. It's filled with exactly the opposite of what you
were describing.
ICUTokenizerFactory's customizability has been
12 matches
Mail list logo