Need to set outputUnigrams = false with something like:
StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43,
reader);
TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
Tommy,
I'm sure that I don't fully understand your use case and your data. Some
thoughts:
1) I assume that fuzzy term search (edit distance = 2) isn't meeting your
needs or else you wouldn't have gone the ngram route. If fuzzy term search +
phrase/proximity search would meet your needs,
working very quickly!
From: Allison, Timothy B. [talli...@mitre.org]
Sent: Thursday, July 18, 2013 7:49 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams
Tommy,
I'm sure that I don't fully understand your use case and your
If Jack's recommendation for keeping stopwords will work in your use case, this
constructor should do the trick:
Analyzer analyzer = new StandardAnalyzer(VERSION, CharArraySet.EMPTY_SET)
From: Jack Krupansky [j...@basetechnology.com]
Sent: Friday, July
Try:
http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html
-Original Message-
From: raghavendra.k@barclays.com [mailto:raghavendra.k@barclays.com]
Sent: Friday, August 02, 2013 3:17 PM
To:
I agree with Ivan and Koji. You also might want to look into MoreLikeThis,
which should take care of finding the highest tf*idf terms for you to use in
your query --
http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html
Best,
Tim
to find out what the best term to choose.
Thanks.
2013/9/4 Allison, Timothy B. talli...@mitre.org:
I agree with Ivan and Koji. You also might want to look into MoreLikeThis,
which should take care of finding the highest tf*idf terms for you to use in
your query --
http://lucene.apache.org
All,
Apologies if I missed this in the documentation, but should:
FuzzyQuery q = new FuzzyQuery(new Term(field, ab), 2)
retrieve a document that contains:
abcd
and vice versa.
Same question for: xy~1 and a document that contains x.
Will submit test case if this is not a known issue or a
or edit distance 1 of x then then may cause your example abcd
to rank below the top 50, and be pruned.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Sep 11, 2013 at 9:42 PM, Allison, Timothy B. talli...@mitre.org wrote:
All,
Apologies if I missed this in the documentation, but should
Brian,
It looks like variable is variable; and you'll probably want to use some
combination of PhraseQuery, FuzzyQuery and maybe BooleanQuery. I've made my
best guess at what the underlying types of Queries would be that would meet
your use cases below.
free text : Doc1, Doc2 ::
1) An alternate method to your original question would be to do something like
this (I haven't compiled or tested this!):
Query q = new PrefixQuery(new Term(field, app));
q = q.rewrite(indexReader) ;
SetTerm terms = new HashSetTerm();
q.extractTerms(terms);
Term[] arr = terms.toArray(new
TotalHitCountCollector?
Others on the list may have a more efficient method, but that'd be
straightforward.
-Original Message-
From: Peyman Faratin [mailto:peymanfara...@gmail.com]
Sent: Monday, December 16, 2013 10:05 PM
To: java-user@lucene.apache.org
Subject: docFreq of a Boolean
To confirm, Lucene does not perform OCR. (If you are looking for open source
java ocr packages, you might take a look here for some ideas:
https://issues.apache.org/jira/i#browse/TIKA-93). Are you trying to find a
corpus of noisy OCR'd text to use as input into Lucene? If so, this looks
This will be of no immediate help, but in the next iteration of LUCENE-5317,
which I'll post in a few weeks (if I can find the time), I'll have an option to
pull concordance windows from character offsets which can be stored at index
time (so you wouldn't have to re-analyze). The current
Ditto Jack on ComplexPhraseQueryParser.
See also: https://issues.apache.org/jira/i#browse/LUCENE-5205
-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, February 05, 2014 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: Wildcard searches
Take
Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, February 06, 2014 8:02 AM
To: java-user@lucene.apache.org
Subject: RE: Wildcard searches
Ditto Jack on ComplexPhraseQueryParser.
See also: https://issues.apache.org/jira/i#browse/LUCENE-5205
-Original Message
links for ComplexPhraseQueryParser that you may be aware of? I am looking for
some examples. Thanks!
Regards,
Raghu
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, February 06, 2014 8:02 AM
To: java-user@lucene.apache.org
Subject: RE: Wildcard
What analyzer are you using? smartcn?
From: kalaik [kalaiselva...@zohocorp.com]
Sent: Friday, March 21, 2014 5:10 AM
To: java-user@lucene.apache.org
Subject: QueryParser
Dear Team,
we are using lucene in our product , it well searching
To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK
into characters:
1 : 轻
2 : 歌
3 : 曼
4 : 舞
5 : 庆
6 : 元
7 : 旦
If you initialize the classic QueryParser with StandardAnalyzer, the parser
will use that Analyzer to break this string into individual characters as
above.
I agree entirely with Robert about not doubling up on the filter, wrapper. To
stop unigrams, consider setOutputUnigrams(false).
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Wednesday, April 02, 2014 2:50 PM
To: java-user
Subject: Re: Strange behavior of
One simple hack which may or may not meet your objectives:
1) index each paragraph as if it were a document (this would then not allow
Boolean across paragraphs, which could be a problem)
2) set the position increment gap to, say, 100 and then index each sentence
within the paragraph as
Chris,
Good to see you over here.
There's probably an easier way...
I ran into this with geo queries, and the answer there is to test every value
in the multi field for the document that is a hit.
For the text search question, though, you could use analysis and then run a
SpanQuery against
Hi Darin,
Have you thought about using multivalued fields? If you set the
positionIncrementGap to something kind of big (well 1, say :) ), and you know
that your data is always authorfirst, authorlast, you could just search for
darin fulford.
The positionincrementgap will prevent matching
.
I guess I'm curious if what I was doing with the SpanQuery should have worked,
whether I misunderstood something, or if this is a bug.
Darin.
From: Allison, Timothy B. talli...@mitre.org
To: java-user@lucene.apache.org java-user@lucene.apache.org; Darin McBeath
The problem is that you are using an analyzer at index time but then not at
search time.
StandardAnalyzer will convert Name1 to name1 at index time.
At search time, because you aren't using a query parser (which would by default
lowercase your terms) you are literally searching for Name1 which
And if you're looking for a parser, take a look at LUCENE-5205.
[george washington carver]!~5,5
Find George Washington but not if carver appears 5 words before or 5 words
after.
-Original Message-
From: Michael Ryan [mailto:mr...@moreover.com]
Sent: Monday, July 14, 2014 9:58 PM
To:
If you're looking for a parser, take a look at ComplexPhraseQueryParser or
LUCENE-5205.
From: Uwe Schindler [u...@thetaphi.de]
Sent: Tuesday, September 23, 2014 6:32 AM
To: java-user@lucene.apache.org
Subject: RE: How to use 'PhraseQuery' with Fuzzy?!
If you can't change the analyzer, you can programmatically build a
MultiPhraseQuery (you'd have to fill in the alternatives ... not a great
option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites are
taken care of for you).
You might also want to look into using the
Might also look at concordance code on LUCENE-5317 and here:
https://github.com/tballison/lucene-addons/tree/master/lucene-5317
Let me know if you have any questions.
-Original Message-
From: Maisnam Ns [mailto:maisnam...@gmail.com]
Sent: Thursday, February 12, 2015 11:57 AM
To:
Agree on span query.
Might try SpanNotQuery(record, type, 0, 1)...
Find record but not if type comes one word after record.
warning type=self_promotionIf you use LUCENE-5205's SpanQueryParser:
record type!~0,1/warning
-Original Message-
From: Trejkaz [mailto:trej...@trypticon.org]
passed to SpanCollector.collectLeaf() is the position,
rather than an index of any kind, which I think is going to mess things up for
you. But other than that, you've got the right idea. :-)
Alan Woodward
www.flax.co.uk
On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote:
> All,
>
> I'm try
All,
I'm trying to find all spans in a given String via stored offsets in Lucene
5.3.1. I wanted to use the Highlighter with a NullFragmenter, but that is
highlighting only the matching terms, not the full Spans (related to
LUCENE-6796?).
My Current code iterates through the spans,
And, if you're looking for a parser, take a look at LUCENE-5205's parser,
available as a standalone on github [0].
The syntax for the query mentioned in archived link would be:
"microsoft [belgium TO spain]"
[0] https://github.com/tballison/lucene-addons
-Original Message-
From: Uwe
Y, to add to Scott's advice, make sure to use the NullFragmenter and make sure
to setExpandMultiTermQuery to true on your scorer
QueryScorer scorer = new QueryScorer(query, field);
scorer.setExpandMultiTermQuery(true);
If you need to highlight entire phrases, see Koji
Great to see 5.4.0 is out.
I tried to update my fork of LUCENE-5205, and found that multiterms within a
SpanNotQuery don't seem to be processed correctly.
[fever bieb*]!~2,5
Find "fever" but not if a multiterm hit on bieb* appears within 2 words before
or 5 words after.
In 5.3.1, this worked
I'm getting this (with a single document that contains the word 'quartz':
Term freq(indexReader.totalTermFreq(term))=0
Term freq(indexReader.getSumTotalTermFreq("Doc"))=1
totalHits = 1
termStatics=0
Is this what you're getting? So...the search is working, but the term counts
aren't returning
If you want to find the matching terms, you have to do something like this:
Query rewritten = spanTerm.rewrite(indexReader);
Weight w = rewritten.createWeight(isearcher, false);
Set terms = new HashSet<>();
w.extractTerms(terms);
for
That package has an ICU tokenizer and the ICUFoldingFilter.
The ICUFoldingFilter does advanced (well, Unicode compliant) case
folding/lowercasing/normalization and is critical for non-ascii languages. You
can use that in place of the AsciiFoldingFilter and the LowerCaseFilter, and it
should
Bouncing over to user’s list.
As you’ve found, spans are different from regular queries. MUST_NOT at the
BooleanQuery level means that the term must not appear anywhere in the
document; whereas spans focus on terms near each other.
Have you tried SpanNotQuery? This would allow you at least
ator.empty())) {
continue;
}
boolean cont = visitLeafReader(ctx, spans, filterItr, visitor);
...
}
-Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, April 12, 2016 10:07 AM
To: java-user@lucene.apache.org
Subject: migrat
Unfortunately, that does require a new type of query. As you probably know,
you can do the "at least" (minimum number should match) with regular
BooleanQueries, but you can't yet do the "at least" with SpanQuery. You might
want to look at modifying the SpanOrQuery to get this functionality.
Doh, sorry, Uwe, didn't see your response first.
Scratch SpanOr, take a look at SpanNear. This would be a great capability to
have!
-Original Message-
From: Allison, Timothy B.
Sent: Wednesday, August 31, 2016 3:30 PM
To: java-user@lucene.apache.org
Subject: RE: New type of proximity
https://issues.apache.org/jira/browse/LUCENE-7434
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, August 31, 2016 3:41 PM
To: java-user@lucene.apache.org
Subject: RE: New type of proximity/fuzzy search
Doh, sorry, Uwe, didn't see your response
Take a look at LUCENE-5317 [1] and LUCENE-5318 [2].
They're available on my github site [3], and I've pushed them to maven central
[4].
LUCENE-5318 is crazily useful as a term/phrase recommender system.
I haven't documented either very well yet. I'll try to add documentation to my
github
start; i < end; i++) {
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
System.out.println((i + 1) + ". " + path);
query.rewrite(reader);
}
}
}
Evert Wagenaa
Make sure to setRewriteMethod on the MultiTermQuery to:
MultiTermQuery.SCORING_BOOLEAN_REWRITE or CONSTANT_SCORE_BOOLEAN_REWRITE
Then something like this should work:
q = q.rewrite(reader);
Set terms = new HashSet<>();
Weight weight = q.createWeight(searcher, false);
Not part of Lucene, but take a look at LUCENE-5205 [1], which I actively
maintain on github [2].
And, you can integrate via maven [3]
See the jira issue for an overview of the query syntax, and let me know if you
have any questions.
[1] https://issues.apache.org/jira/browse/LUCENE-5205
[2]
I have code as part of LUCENE-5318 that counts terms that cooccur within a
window of where your query terms appear. This makes a really useful query term
recommender, and the math is dirt simple.
INPUT
Doc1: quick brown fox jumps over the lazy dog
Doc2: quick green fox leaps over the lazy dog
never mind...overwriting service file...
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, August 15, 2017 10:36 PM
To: java-user@lucene.apache.org
Subject: ICUFoldingFilter loading in IDE, but not jar ?!
In Intellij, when I run unit tests in my
In Intellij, when I run unit tests in my app that uses Lucene (6.6.0) and the
ICUFoldingFilterFactory, I see 96 filter factories available via
TokenFilterFactory.availableTokenFilters(). When I run the same code from a
jar built with the maven shade plugin, and I confirm that the jar actually
As an example of Mikhail's suggestion:
https://github.com/tballison/lucene-addons/blob/master/lucene-5317/src/main/java/org/apache/lucene/search/concordance/charoffsets/SpansCrawler.java
If you are trying to build a concordance, see ConcordanceSearcher in that
package.
See examples on how to
. No need to
write your own one.
Uwe
-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Friday, June 23, 2017 3:55 PM
> To: java-user@lucene.ap
I plagiarized Solr's org.apache.solr.analysis.TokenizerChain to read the
configuration from a json file:
https://github.com/tballison/lucene-addons/blob/6.x/gramreaper/src/main/java/org/tallison/gramreaper/ingest/schema/MyTokenizerChain.java
I wouldn't recommend using anything in gramreaper
Prob better question for user list.
From: Dominik Safaric [mailto:dominiksafa...@gmail.com]
Sent: Monday, February 26, 2018 1:20 PM
To: d...@lucene.apache.org
Subject: PointValues ordering
Given a multi-valued and non-indexed point value field, how does Lucene
internally store this kind of
54 matches
Mail list logo