RE: ShingleFilter

2013-07-18 Thread Allison, Timothy B.
Need to set outputUnigrams = false with something like: StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43, reader); TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source); tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);

RE: Partial word match using n-grams

2013-07-18 Thread Allison, Timothy B.
Tommy, I'm sure that I don't fully understand your use case and your data. Some thoughts: 1) I assume that fuzzy term search (edit distance = 2) isn't meeting your needs or else you wouldn't have gone the ngram route. If fuzzy term search + phrase/proximity search would meet your needs,

RE: Partial word match using n-grams

2013-07-19 Thread Allison, Timothy B.
working very quickly! From: Allison, Timothy B. [talli...@mitre.org] Sent: Thursday, July 18, 2013 7:49 PM To: java-user@lucene.apache.org Subject: RE: Partial word match using n-grams Tommy, I'm sure that I don't fully understand your use case and your

RE: Searching for words begining with or

2013-07-19 Thread Allison, Timothy B.
If Jack's recommendation for keeping stopwords will work in your use case, this constructor should do the trick: Analyzer analyzer = new StandardAnalyzer(VERSION, CharArraySet.EMPTY_SET) From: Jack Krupansky [j...@basetechnology.com] Sent: Friday, July

RE: PhraseQuery Search

2013-08-05 Thread Allison, Timothy B.
Try: http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html -Original Message- From: raghavendra.k@barclays.com [mailto:raghavendra.k@barclays.com] Sent: Friday, August 02, 2013 3:17 PM To:

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
I agree with Ivan and Koji. You also might want to look into MoreLikeThis, which should take care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html Best, Tim

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
to find out what the best term to choose. Thanks. 2013/9/4 Allison, Timothy B. talli...@mitre.org: I agree with Ivan and Koji. You also might want to look into MoreLikeThis, which should take care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org

FuzzyQuery with short words

2013-09-11 Thread Allison, Timothy B.
All, Apologies if I missed this in the documentation, but should: FuzzyQuery q = new FuzzyQuery(new Term(field, ab), 2) retrieve a document that contains: abcd and vice versa. Same question for: xy~1 and a document that contains x. Will submit test case if this is not a known issue or a

RE: FuzzyQuery with short words

2013-09-12 Thread Allison, Timothy B.
or edit distance 1 of x then then may cause your example abcd to rank below the top 50, and be pruned. Mike McCandless http://blog.mikemccandless.com On Wed, Sep 11, 2013 at 9:42 PM, Allison, Timothy B. talli...@mitre.org wrote: All, Apologies if I missed this in the documentation, but should

RE: variable string search

2013-09-13 Thread Allison, Timothy B.
Brian, It looks like variable is variable; and you'll probably want to use some combination of PhraseQuery, FuzzyQuery and maybe BooleanQuery. I've made my best guess at what the underlying types of Queries would be that would meet your use cases below. free text : Doc1, Doc2 ::

RE: Multiphrase Query in Lucene 4.3

2013-09-27 Thread Allison, Timothy B.
1) An alternate method to your original question would be to do something like this (I haven't compiled or tested this!): Query q = new PrefixQuery(new Term(field, app)); q = q.rewrite(indexReader) ; SetTerm terms = new HashSetTerm(); q.extractTerms(terms); Term[] arr = terms.toArray(new

RE: docFreq of a Boolean query (LUCENE 4.3)

2013-12-17 Thread Allison, Timothy B.
TotalHitCountCollector? Others on the list may have a more efficient method, but that'd be straightforward. -Original Message- From: Peyman Faratin [mailto:peymanfara...@gmail.com] Sent: Monday, December 16, 2013 10:05 PM To: java-user@lucene.apache.org Subject: docFreq of a Boolean

RE: Sample Data to Test Lucene

2014-01-16 Thread Allison, Timothy B.
To confirm, Lucene does not perform OCR. (If you are looking for open source java ocr packages, you might take a look here for some ideas: https://issues.apache.org/jira/i#browse/TIKA-93). Are you trying to find a corpus of noisy OCR'd text to use as input into Lucene? If so, this looks

RE: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-04 Thread Allison, Timothy B.
This will be of no immediate help, but in the next iteration of LUCENE-5317, which I'll post in a few weeks (if I can find the time), I'll have an option to pull concordance windows from character offsets which can be stored at index time (so you wouldn't have to re-analyze). The current

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
Ditto Jack on ComplexPhraseQueryParser. See also: https://issues.apache.org/jira/i#browse/LUCENE-5205 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, February 05, 2014 6:59 PM To: java-user@lucene.apache.org Subject: Re: Wildcard searches Take

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, February 06, 2014 8:02 AM To: java-user@lucene.apache.org Subject: RE: Wildcard searches Ditto Jack on ComplexPhraseQueryParser. See also: https://issues.apache.org/jira/i#browse/LUCENE-5205 -Original Message

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
links for ComplexPhraseQueryParser that you may be aware of? I am looking for some examples. Thanks! Regards, Raghu -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, February 06, 2014 8:02 AM To: java-user@lucene.apache.org Subject: RE: Wildcard

RE: QueryParser

2014-03-21 Thread Allison, Timothy B.
What analyzer are you using? smartcn? From: kalaik [kalaiselva...@zohocorp.com] Sent: Friday, March 21, 2014 5:10 AM To: java-user@lucene.apache.org Subject: QueryParser Dear Team, we are using lucene in our product , it well searching

RE: QueryParser

2014-03-24 Thread Allison, Timothy B.
To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK into characters: 1 : 轻 2 : 歌 3 : 曼 4 : 舞 5 : 庆 6 : 元 7 : 旦 If you initialize the classic QueryParser with StandardAnalyzer, the parser will use that Analyzer to break this string into individual characters as above.

RE: Strange behavior of ShingleFilter in Lucene 4.6

2014-04-02 Thread Allison, Timothy B.
I agree entirely with Robert about not doubling up on the filter, wrapper. To stop unigrams, consider setOutputUnigrams(false). -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, April 02, 2014 2:50 PM To: java-user Subject: Re: Strange behavior of

RE: Proximity Search for SENTENCE and PARAGRAPH

2014-04-07 Thread Allison, Timothy B.
One simple hack which may or may not meet your objectives: 1) index each paragraph as if it were a document (this would then not allow Boolean across paragraphs, which could be a problem) 2) set the position increment gap to, say, 100 and then index each sentence within the paragraph as

RE: Question about multi-valued fields

2014-05-20 Thread Allison, Timothy B.
Chris, Good to see you over here. There's probably an easier way... I ran into this with geo queries, and the answer there is to test every value in the multi field for the document that is a hit. For the text search question, though, you could use analysis and then run a SpanQuery against

RE: SpanQuery not working as expected

2014-06-06 Thread Allison, Timothy B.
Hi Darin, Have you thought about using multivalued fields? If you set the positionIncrementGap to something kind of big (well 1, say :) ), and you know that your data is always authorfirst, authorlast, you could just search for darin fulford. The positionincrementgap will prevent matching

RE: SpanQuery not working as expected

2014-06-09 Thread Allison, Timothy B.
. I guess I'm curious if what I was doing with the SpanQuery should have worked, whether I misunderstood something, or if this is a bug. Darin. From: Allison, Timothy B. talli...@mitre.org To: java-user@lucene.apache.org java-user@lucene.apache.org; Darin McBeath

RE: Index Not Finding Results some times

2014-06-16 Thread Allison, Timothy B.
The problem is that you are using an analyzer at index time but then not at search time. StandardAnalyzer will convert Name1 to name1 at index time. At search time, because you aren't using a query parser (which would by default lowercase your terms) you are literally searching for Name1 which

RE: Finding words not followed by other words

2014-07-15 Thread Allison, Timothy B.
And if you're looking for a parser, take a look at LUCENE-5205. [george washington carver]!~5,5 Find George Washington but not if carver appears 5 words before or 5 words after. -Original Message- From: Michael Ryan [mailto:mr...@moreover.com] Sent: Monday, July 14, 2014 9:58 PM To:

RE: How to use 'PhraseQuery' with Fuzzy?!

2014-09-23 Thread Allison, Timothy B.
If you're looking for a parser, take a look at ComplexPhraseQueryParser or LUCENE-5205. From: Uwe Schindler [u...@thetaphi.de] Sent: Tuesday, September 23, 2014 6:32 AM To: java-user@lucene.apache.org Subject: RE: How to use 'PhraseQuery' with Fuzzy?!

RE: multiterm numbers regexp search

2014-12-15 Thread Allison, Timothy B.
If you can't change the analyzer, you can programmatically build a MultiPhraseQuery (you'd have to fill in the alternatives ... not a great option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites are taken care of for you). You might also want to look into using the

RE: Proximity query

2015-02-12 Thread Allison, Timothy B.
Might also look at concordance code on LUCENE-5317 and here: https://github.com/tballison/lucene-addons/tree/master/lucene-5317 Let me know if you have any questions. -Original Message- From: Maisnam Ns [mailto:maisnam...@gmail.com] Sent: Thursday, February 12, 2015 11:57 AM To:

RE: ignore a match in a query

2015-07-24 Thread Allison, Timothy B.
Agree on span query. Might try SpanNotQuery(record, type, 0, 1)... Find record but not if type comes one word after record. warning type=self_promotionIf you use LUCENE-5205's SpanQueryParser: record type!~0,1/warning -Original Message- From: Trejkaz [mailto:trej...@trypticon.org]

RE: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

2015-11-03 Thread Allison, Timothy B.
passed to SpanCollector.collectLeaf() is the position, rather than an index of any kind, which I think is going to mess things up for you. But other than that, you've got the right idea. :-) Alan Woodward www.flax.co.uk On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote: > All, > > I'm try

extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

2015-11-02 Thread Allison, Timothy B.
All, I'm trying to find all spans in a given String via stored offsets in Lucene 5.3.1. I wanted to use the Highlighter with a NullFragmenter, but that is highlighting only the matching terms, not the full Spans (related to LUCENE-6796?). My Current code iterates through the spans,

RE: TermRangeQuery with Proximity

2015-12-08 Thread Allison, Timothy B.
And, if you're looking for a parser, take a look at LUCENE-5205's parser, available as a standalone on github [0]. The syntax for the query mentioned in archived link would be: "microsoft [belgium TO spain]" [0] https://github.com/tballison/lucene-addons -Original Message- From: Uwe

RE: Highlighting deprecation?

2015-12-02 Thread Allison, Timothy B.
Y, to add to Scott's advice, make sure to use the NullFragmenter and make sure to setExpandMultiTermQuery to true on your scorer QueryScorer scorer = new QueryScorer(query, field); scorer.setExpandMultiTermQuery(true); If you need to highlight entire phrases, see Koji

different handling of multiterm within a SpanNot Query in 5.3.1 vs 5.4.0?

2015-12-14 Thread Allison, Timothy B.
Great to see 5.4.0 is out. I tried to update my fork of LUCENE-5205, and found that multiterms within a SpanNotQuery don't seem to be processed correctly. [fever bieb*]!~2,5 Find "fever" but not if a multiterm hit on bieb* appears within 2 words before or 5 words after. In 5.3.1, this worked

RE: Wild card search not working

2015-11-30 Thread Allison, Timothy B.
I'm getting this (with a single document that contains the word 'quartz': Term freq(indexReader.totalTermFreq(term))=0 Term freq(indexReader.getSumTotalTermFreq("Doc"))=1 totalHits = 1 termStatics=0 Is this what you're getting? So...the search is working, but the term counts aren't returning

RE: Wild card search not working

2015-11-30 Thread Allison, Timothy B.
If you want to find the matching terms, you have to do something like this: Query rewritten = spanTerm.rewrite(indexReader); Weight w = rewritten.createWeight(isearcher, false); Set terms = new HashSet<>(); w.extractTerms(terms); for

RE: analyzers-common VS analyzers-icu

2016-06-01 Thread Allison, Timothy B.
That package has an ICU tokenizer and the ICUFoldingFilter. The ICUFoldingFilter does advanced (well, Unicode compliant) case folding/lowercasing/normalization and is critical for non-ascii languages. You can use that in place of the AsciiFoldingFilter and the LowerCaseFilter, and it should

RE: SpanQuery - How to wrap a NOT subquery

2016-06-20 Thread Allison, Timothy B.
Bouncing over to user’s list. As you’ve found, spans are different from regular queries. MUST_NOT at the BooleanQuery level means that the term must not appear anywhere in the document; whereas spans focus on terms near each other. Have you tried SpanNotQuery? This would allow you at least

RE: migrating to 6.0 -- how to apply filter to getSpans

2016-05-23 Thread Allison, Timothy B.
ator.empty())) { continue; } boolean cont = visitLeafReader(ctx, spans, filterItr, visitor); ... } -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, April 12, 2016 10:07 AM To: java-user@lucene.apache.org Subject: migrat

RE: New type of proximity/fuzzy search

2016-08-31 Thread Allison, Timothy B.
Unfortunately, that does require a new type of query. As you probably know, you can do the "at least" (minimum number should match) with regular BooleanQueries, but you can't yet do the "at least" with SpanQuery. You might want to look at modifying the SpanOrQuery to get this functionality.

RE: New type of proximity/fuzzy search

2016-08-31 Thread Allison, Timothy B.
Doh, sorry, Uwe, didn't see your response first. Scratch SpanOr, take a look at SpanNear. This would be a great capability to have! -Original Message- From: Allison, Timothy B. Sent: Wednesday, August 31, 2016 3:30 PM To: java-user@lucene.apache.org Subject: RE: New type of proximity

RE: New type of proximity/fuzzy search

2016-09-01 Thread Allison, Timothy B.
https://issues.apache.org/jira/browse/LUCENE-7434 -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, August 31, 2016 3:41 PM To: java-user@lucene.apache.org Subject: RE: New type of proximity/fuzzy search Doh, sorry, Uwe, didn't see your response

RE: Cooccurrence matrices

2016-09-19 Thread Allison, Timothy B.
Take a look at LUCENE-5317 [1] and LUCENE-5318 [2]. They're available on my github site [3], and I've pushed them to maven central [4]. LUCENE-5318 is crazily useful as a term/phrase recommender system. I haven't documented either very well yet. I'll try to add documentation to my github

RE: How to get the terms matching a WildCardQuery in Lucene 6.2?

2016-10-25 Thread Allison, Timothy B.
start; i < end; i++) { Document doc = searcher.doc(hits[i].doc); String path = doc.get("path"); System.out.println((i + 1) + ". " + path); query.rewrite(reader); } } } Evert Wagenaa

RE: How to get the terms matching a WildCardQuery in Lucene 6.2?

2016-10-24 Thread Allison, Timothy B.
Make sure to setRewriteMethod on the MultiTermQuery to: MultiTermQuery.SCORING_BOOLEAN_REWRITE or CONSTANT_SCORE_BOOLEAN_REWRITE Then something like this should work: q = q.rewrite(reader); Set terms = new HashSet<>(); Weight weight = q.createWeight(searcher, false);

RE: query parser of SpanNearQuery

2016-12-05 Thread Allison, Timothy B.
Not part of Lucene, but take a look at LUCENE-5205 [1], which I actively maintain on github [2]. And, you can integrate via maven [3] See the jira issue for an overview of the query syntax, and let me know if you have any questions. [1] https://issues.apache.org/jira/browse/LUCENE-5205 [2]

RE: calculate term co-occurrence matrix

2017-03-20 Thread Allison, Timothy B.
I have code as part of LUCENE-5318 that counts terms that cooccur within a window of where your query terms appear. This makes a really useful query term recommender, and the math is dirt simple. INPUT Doc1: quick brown fox jumps over the lazy dog Doc2: quick green fox leaps over the lazy dog

RE: ICUFoldingFilter loading in IDE, but not jar ?!

2017-08-15 Thread Allison, Timothy B.
never mind...overwriting service file... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, August 15, 2017 10:36 PM To: java-user@lucene.apache.org Subject: ICUFoldingFilter loading in IDE, but not jar ?! In Intellij, when I run unit tests in my

ICUFoldingFilter loading in IDE, but not jar ?!

2017-08-15 Thread Allison, Timothy B.
In Intellij, when I run unit tests in my app that uses Lucene (6.6.0) and the ICUFoldingFilterFactory, I see 96 filter factories available via TokenFilterFactory.availableTokenFilters(). When I run the same code from a jar built with the maven shade plugin, and I confirm that the jar actually

RE: Correction: SpanNearQuery Class issue through spans object (Not through Searcher.search() method)

2017-06-20 Thread Allison, Timothy B.
As an example of Mikhail's suggestion: https://github.com/tballison/lucene-addons/blob/master/lucene-5317/src/main/java/org/apache/lucene/search/concordance/charoffsets/SpansCrawler.java If you are trying to build a concordance, see ConcordanceSearcher in that package. See examples on how to

RE: Extending Analyzer at runtime

2017-06-23 Thread Allison, Timothy B.
. No need to write your own one. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Friday, June 23, 2017 3:55 PM > To: java-user@lucene.ap

RE: Extending Analyzer at runtime

2017-06-23 Thread Allison, Timothy B.
I plagiarized Solr's org.apache.solr.analysis.TokenizerChain to read the configuration from a json file: https://github.com/tballison/lucene-addons/blob/6.x/gramreaper/src/main/java/org/tallison/gramreaper/ingest/schema/MyTokenizerChain.java I wouldn't recommend using anything in gramreaper

FW: PointValues ordering

2018-02-26 Thread Allison, Timothy B.
Prob better question for user list. From: Dominik Safaric [mailto:dominiksafa...@gmail.com] Sent: Monday, February 26, 2018 1:20 PM To: d...@lucene.apache.org Subject: PointValues ordering Given a multi-valued and non-indexed point value field, how does Lucene internally store this kind of