Re: Compatibility problems between AnalyzerWrapper api & MultiTerms.getTerms api

2020-04-15 Thread
you're right! i made a mistake in my custom AnalyzerWrapper subclass... :-<

Adrien Grand  于2020年4月15日周三 下午5:37写道:

> Could you create a test case that is as small as possible and reproduces
> the problem? I don't think that MultiTerms has anything to do with this.
>
> On Tue, Apr 14, 2020 at 9:52 AM 小鱼儿  wrote:
>
> > I'm using AnalyzerWrapper to do per-field analyzer to do special
> indexing:
> >
> > PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(..);
> > // PerFieldAnalyzerWrapper  is subclass of Lucene's AnalyzerWrapper
> > IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
> >
> > However, i found that later when i used MultiTerms.getTerms to load the
> > specific field's term dictionary, it's like it is still analyzed by
> > Lucene's StandardAnalyzer.
> >
> > I have to use another trick to bypass this problem(use
> > custom IndexableField class to do per-field custom analyzer, which is not
> > needed to detail here), i guess ultiTerms.getTerms is an experimental api
> > so it's not consistent with nalyzerWrapper?
> >
>
>
> --
> Adrien
>


Compatibility problems between AnalyzerWrapper api & MultiTerms.getTerms api

2020-04-14 Thread
I'm using AnalyzerWrapper to do per-field analyzer to do special indexing:

PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(..);
// PerFieldAnalyzerWrapper  is subclass of Lucene's AnalyzerWrapper
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

However, i found that later when i used MultiTerms.getTerms to load the
specific field's term dictionary, it's like it is still analyzed by
Lucene's StandardAnalyzer.

I have to use another trick to bypass this problem(use
custom IndexableField class to do per-field custom analyzer, which is not
needed to detail here), i guess ultiTerms.getTerms is an experimental api
so it's not consistent with nalyzerWrapper?


Re: Question about PhraseQuery's capacity...

2020-01-12 Thread
hi i have filed a issue to lucene-core:
https://issues.apache.org/jira/browse/LUCENE-9130
i just write a test case, and find that BooelanQuery with MUST filter mode
is ok, but PhraseQuery fails

小鱼儿  于2020年1月10日周五 下午7:14写道:

> explain api helps! thanks for hint~!
> I have found out that one case failed becaused i carelessly add another
> filter condition, but the other case (which is analyzed into 30 terms)
> still failed, doesn't know why
> I guess i need to write a unit testcase to use MultiTerms.getTerms API to
> find out if there is any mismatch in analyzer's processing or if there is a
> capacity limit in PhraseQuery...
>
> Mikhail Khludnev  于2020年1月10日周五 下午6:21写道:
>
>> Hello,
>> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>>
>> On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿  wrote:
>>
>> > After i directly call Analyzer.tokenStream() method to extract terms
>> from
>> > query, i still cannot get results. Doesn't know the why...
>> >
>> > Code when build index:
>> >IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
>> //new
>> > SmartChineseAnalyzer();
>> >
>> > Code do query:
>> > (1) extract terms from query text:
>> >
>> >  public List analysis(String fieldName, String text) {
>> > List terms = new ArrayList();
>> > TokenStream stream = analyzer.tokenStream(fieldName, text);
>> > try {
>> > stream.reset();
>> > while(stream.incrementToken()) {
>> > CharTermAttribute termAtt =
>> stream.getAttribute(CharTermAttribute.class);
>> > String term = termAtt.toString();
>> > terms.add(term);
>> > }
>> > stream.end();
>> > } catch (IOException e) {
>> > e.printStackTrace();
>> > log.error(e.getMessage(), e);
>> > }
>> > return terms;
>> > }
>> >
>> > (2) Code to construct a PhraseQuery:
>> >
>> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
>> String
>> > queryText, int slop) {
>> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
>> > builder.setSlop(2); //? max is 2;
>> > List terms = analyzer.analysis(fieldName, queryText);
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(term);
>> > }
>> > Query query = builder.build();
>> > return query;
>> > }
>> >
>> > Use BooleanQuery also failed:
>> >
>> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
>> > String queryText) {
>> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
>> > List terms = analyzer.analysis(fieldName, queryText);
>> > log.info("terms: "+StringUtils.join(terms, ", "));
>> > for(String termKeyword : terms) {
>> > Term term = new Term(fieldName, termKeyword);
>> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
>> > }
>> > return builder.build();
>> > }
>> >
>> > Adrien Grand  于2020年1月10日周五 下午4:53写道:
>> >
>> > > It should match. My guess is that you might not reusing the same
>> > positions
>> > > as set by the analysis chain when creating the phrase query? Can you
>> show
>> > > us how you build the phrase query?
>> > >
>> > > On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿  wrote:
>> > >
>> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
>> with
>> > a
>> > > > TextField whose value is a long sentence, when anaylized, will get
>> 18
>> > > > terms.
>> > > >
>> > > > & then i use the same value to construct a PhraseQuery, setting
>> slop to
>> > > 2,
>> > > > and adding the 18 terms concequently...
>> > > >
>> > > > I expect the search api to find this document, but it returns empty.
>> > > >
>> > > > Where am i wrong?
>> > > >
>> > >
>> > >
>> > > --
>> > > Adrien
>> > >
>> >
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>


Re: Question about PhraseQuery's capacity...

2020-01-10 Thread
explain api helps! thanks for hint~!
I have found out that one case failed becaused i carelessly add another
filter condition, but the other case (which is analyzed into 30 terms)
still failed, doesn't know why
I guess i need to write a unit testcase to use MultiTerms.getTerms API to
find out if there is any mismatch in analyzer's processing or if there is a
capacity limit in PhraseQuery...

Mikhail Khludnev  于2020年1月10日周五 下午6:21写道:

> Hello,
> Sometimes IndexSearcher.explain(Query, int) allows to analyse mismatches.
>
> On Fri, Jan 10, 2020 at 1:13 PM 小鱼儿  wrote:
>
> > After i directly call Analyzer.tokenStream() method to extract terms from
> > query, i still cannot get results. Doesn't know the why...
> >
> > Code when build index:
> >IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
> > SmartChineseAnalyzer();
> >
> > Code do query:
> > (1) extract terms from query text:
> >
> >  public List analysis(String fieldName, String text) {
> > List terms = new ArrayList();
> > TokenStream stream = analyzer.tokenStream(fieldName, text);
> > try {
> > stream.reset();
> > while(stream.incrementToken()) {
> > CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
> > String term = termAtt.toString();
> > terms.add(term);
> > }
> > stream.end();
> > } catch (IOException e) {
> > e.printStackTrace();
> > log.error(e.getMessage(), e);
> > }
> > return terms;
> > }
> >
> > (2) Code to construct a PhraseQuery:
> >
> > private Query buildPhraseQuery(Analyzer analyzer, String fieldName,
> String
> > queryText, int slop) {
> > PhraseQuery.Builder builder = new PhraseQuery.Builder();
> > builder.setSlop(2); //? max is 2;
> > List terms = analyzer.analysis(fieldName, queryText);
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(term);
> > }
> > Query query = builder.build();
> > return query;
> > }
> >
> > Use BooleanQuery also failed:
> >
> > private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
> > String queryText) {
> > BooleanQuery.Builder builder = new BooleanQuery.Builder();
> > List terms = analyzer.analysis(fieldName, queryText);
> > log.info("terms: "+StringUtils.join(terms, ", "));
> > for(String termKeyword : terms) {
> > Term term = new Term(fieldName, termKeyword);
> > builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
> > }
> > return builder.build();
> > }
> >
> > Adrien Grand  于2020年1月10日周五 下午4:53写道:
> >
> > > It should match. My guess is that you might not reusing the same
> > positions
> > > as set by the analysis chain when creating the phrase query? Can you
> show
> > > us how you build the phrase query?
> > >
> > > On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿  wrote:
> > >
> > > > I use SmartChineseAnalyzer to do the indexing, and add a document
> with
> > a
> > > > TextField whose value is a long sentence, when anaylized, will get 18
> > > > terms.
> > > >
> > > > & then i use the same value to construct a PhraseQuery, setting slop
> to
> > > 2,
> > > > and adding the 18 terms concequently...
> > > >
> > > > I expect the search api to find this document, but it returns empty.
> > > >
> > > > Where am i wrong?
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Question about PhraseQuery's capacity...

2020-01-10 Thread
After i directly call Analyzer.tokenStream() method to extract terms from
query, i still cannot get results. Doesn't know the why...

Code when build index:
   IndexWriterConfig iwc = new IndexWriterConfig(analyzer); //new
SmartChineseAnalyzer();

Code do query:
(1) extract terms from query text:

 public List analysis(String fieldName, String text) {
List terms = new ArrayList();
TokenStream stream = analyzer.tokenStream(fieldName, text);
try {
stream.reset();
while(stream.incrementToken()) {
CharTermAttribute termAtt = stream.getAttribute(CharTermAttribute.class);
String term = termAtt.toString();
terms.add(term);
}
stream.end();
} catch (IOException e) {
e.printStackTrace();
log.error(e.getMessage(), e);
}
return terms;
}

(2) Code to construct a PhraseQuery:

private Query buildPhraseQuery(Analyzer analyzer, String fieldName, String
queryText, int slop) {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.setSlop(2); //? max is 2;
List terms = analyzer.analysis(fieldName, queryText);
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(term);
}
Query query = builder.build();
return query;
}

Use BooleanQuery also failed:

private Query buildBooleanANDQuery(Analyzer analyzer, String fieldName,
String queryText) {
BooleanQuery.Builder builder = new BooleanQuery.Builder();
List terms = analyzer.analysis(fieldName, queryText);
log.info("terms: "+StringUtils.join(terms, ", "));
for(String termKeyword : terms) {
Term term = new Term(fieldName, termKeyword);
builder.add(new TermQuery(term), BooleanClause.Occur.MUST);
}
return builder.build();
}

Adrien Grand  于2020年1月10日周五 下午4:53写道:

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿  wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>


Re: Question about PhraseQuery's capacity...

2020-01-10 Thread
Hi Adrien,
 I find i might make a mistake:
 There is 2 level processing in a Analyzer class: one is Tokenizer,
which is HMMChineseTokenizer, and the other is Analyzer which may apply
some filtering...
 I'm using lucene's default interface to set a Analyzer instance to do
the indexing, but i'm using the Tokenizer to parse raw query text to build
the Query.
 The wierd thing is, there is a lucene query-parser module, but it will
deal with some meta syntax like AND/OR filedName:xxx, so i think it cannot
directly deal with the raw query text?
 But when i try to use the upper Analyzer.tokenStream() to parse
separate terms from raw query text, i get the very confusing api:
TokenStream has no clear interface to get the terms(filtered tokens), but
the Attribute concept, which is used only in lucene internals. Where can i
find a sample code to extract the filtered tokens from the TokenStream
interface?

Adrien Grand  于2020年1月10日周五 下午4:53写道:

> It should match. My guess is that you might not reusing the same positions
> as set by the analysis chain when creating the phrase query? Can you show
> us how you build the phrase query?
>
> On Fri, Jan 10, 2020 at 9:24 AM 小鱼儿  wrote:
>
> > I use SmartChineseAnalyzer to do the indexing, and add a document with a
> > TextField whose value is a long sentence, when anaylized, will get 18
> > terms.
> >
> > & then i use the same value to construct a PhraseQuery, setting slop to
> 2,
> > and adding the 18 terms concequently...
> >
> > I expect the search api to find this document, but it returns empty.
> >
> > Where am i wrong?
> >
>
>
> --
> Adrien
>


Question about PhraseQuery's capacity...

2020-01-10 Thread
I use SmartChineseAnalyzer to do the indexing, and add a document with a
TextField whose value is a long sentence, when anaylized, will get 18 terms.

& then i use the same value to construct a PhraseQuery, setting slop to 2,
and adding the 18 terms concequently...

I expect the search api to find this document, but it returns empty.

Where am i wrong?


What's the difference between LatLonPoint and LatLonDocValuesField?

2020-01-10 Thread
In my understanding from reading the oniline documentation, LatLonPoint is
used for BKD indexing, and LatLonDocValuesField is used for Sort argument's
input.

But does it means if a POI has a GeoPoint type "location" field, then i
must add the same location value to the 2 fields which makes me confusing:
because the api exposes internals to api users...

There seems 2 kind of Fields: one is the normal XxxField which is for
indexing and is row-stored, the other (XxxDocValuesField) is
column-stored, LatLonPoint may be taken as indexing field...

In my opinion, the 2 fields are both stored, so StoredField's naming is
somewhat misleading, if i add another StoredField with the same field name,
then later when using List L = doc.getFields(); to retrieve
the document fields, i only get the last StoredField
value,  LatLonDocValuesField is not got.


Quest about Lucene's IndexSearcher.search(Query query, int n) API's parameter n

2020-01-09 Thread
I'm doing a POI(Point-of-interest) search using lucene, each POI has a
"location" which is a GeoPoint/LonLat type. I need do a keyword-range
search but the query result POIs need to sort by distance to a starting
point.

This "distance", in fact, is a dynamic computed property which cannot be
used by the SortField API, i doubt if Lucene can support a
"DynamicSortField", that would be perfect. Or i had to do:
use IndexSearcher.search(Query query, int n) API to first filter out Top-n
POIs and then do a manual sort after these n documents' StoredField's have
all be loaded, which seems not efficient.

The problem is, the parameter n in IndexSearcher.search API has a usability
problem, it may be not large enough to cover all the candidates. & the
low-level search(Query, Collector) API seems to be short of documentations.
If set the n to a very large value, the later sort proc may be very
inefficient...

My current idea: use more detailed near-to-far sub geo ranges to
iteratively/incrementally search/filter -> load documents -> manual sort ->
combine.

Any suggestions?


Needs advice on auto-keyword-correction mode custom query

2020-01-05 Thread
Hi everybody,

I want to implement an auto-keyword-correction mode custom query: suppose a
scenario where user inputs a keyword query A, but due to typo or other
reasons, A should be B, A is not a valid term in lucene's index which B is.
(I'm not considering NLP in high-dimensional semantice space which is out
of scope here)

I could use 2 queries to do this, but it's too costly. What i need is a
"early-termination" mode:
 (1) keyword A will hit a non-empty DocIDSet so will not query B; Or
 (2) keyword A's DocIDSet will be empty and B's will then match

That is "A OR B" likewise in C/C++ language. But here i notice Lucene's
BooleanQuery's SHOULD relationship is not the solving way. Perhaps i need
to implement another custom query class?

btw, How can Lucene's Query API become high-order composable? Lucene's
"LeafContext" concept is really very confusing me...


Re: Question abount combining InvertedIndex and SortField

2020-01-01 Thread
Hi,  Mikhail

Your words is very encouraging. I was thinking i might need to do another
Lucene custom query to apply my business-specific "index-only sort" and
"early-termination"... SortField API says can use any numeric/String field,
which is very perfect. (In this way, Lucene should be able to a high-perf
Top-N query if SortField can support dynamically-generated ranking
scores... not only the native indexed numeric/String fields)

Mikhail Khludnev  于2019年12月31日周二 下午4:41写道:

> Hello, 小鱼儿.
>
> On Tue, Dec 31, 2019 at 6:32 AM 小鱼儿  wrote:
>
> > Assume i first use keyword search to get a DocIDSet from inverted index,
> > then i want to sort these docIds by some numeric field, like a
> > `updateTime`, does Lucene do this without need of loading the Document
> > objects but only with an sorted index on `updateTime`?
>
> 1. Lucene doesn't load Document objects from stored fields files while
> sorting for sure.
> 2. Lucene uses dedicated columnar data structure (DocVaues made index time,
> or in the worst case lazily loaded FieldCache) to obtain field values while
> collecting search results from inverted index.
> 3. One deviation from this generic algorithm is sorted index and early
> termination, that's probably what you meant in "Index-Only Search
> Optimization".
>
>
> > Which i call it
> > "Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)
> >
> > And since Lucene has a `SortField` API, what does it do the sort? I
> thought
> >
> It brings up TopFieldCollector instead of the default TopScoreDocCollector.
>
> > SortField is just a post-processing...
> >
> Not really. Scoring/sorting should be done along side with searching to
> reduce memory footprint by storing only top candidate results in a binary
> heap.
> IIRC it's described in this classic paper
> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Question abount combining InvertedIndex and SortField

2019-12-30 Thread
Assume i first use keyword search to get a DocIDSet from inverted index,
then i want to sort these docIds by some numeric field, like a
`updateTime`, does Lucene do this without need of loading the Document
objects but only with an sorted index on `updateTime`? Which i call it
"Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)

And since Lucene has a `SortField` API, what does it do the sort? I thought
SortField is just a post-processing...


Re: Why Lucene's Suggest API can ONLY load field terms which is Store.YES?

2019-12-27 Thread
But i feel very confused about this design: if i can search by some
indexable field, means there should be some terms stored somewhere, so i
should be able to get these terms as a Dictionary?

Lucene docs says it uses the same field name for 2 kinds of index data
store when set Store.YES,  it seems treating them the same, here i have to
make 2 field names to compat the confusing and inner-conflicting design...

Mikhail Khludnev  于2019年12月27日周五 下午5:05写道:

> Hello,
> It's by design: StringFields are searchable and filled by analysis output,
> StoredFields are returned input values.
> That's it.
>
> On Fri, Dec 27, 2019 at 11:32 AM 小鱼儿  wrote:
>
> > I have a document `category` field, which is a "|,;" separator separated
> > string, in indexing phase, i do manually split the value into atomic
> terms
> > and index as StringField, & i also add a same name StoredField which
> > contains original value form:
> >
> >
> >
> >
> >
> > *List terms = analyzer.analysis((String)fieldValue); for(String
> > term: terms) {  doc.add(new StringField(fieldName, term, Store.NO));
> > }doc.add(new StoredField(fieldName, (String)fieldValue));*
> >
> > Then i use Suggest API to load this field's all terms:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *Set terms = new HashSet();
> > DocumentDictionary dict = new DocumentDictionary(this.indexReader,
> > fieldName, null);InputIterator it;try {it =
> > dict.getEntryIterator();//BytesRef byteRef =
> null;
> >   while((byteRef = it.next()) != null){String
> term
> > = byteRef.utf8ToString();terms.add(term);}
> >   } catch (IOException e) {e.printStackTrace();
> > log.error(e.getMessage(), e);}*
> >
> > To my supprise, terms seems only returning the STORED value, which is the
> > original value form, but i expect they should be the terms i put in each
> > StringField!
> >
> > Is this a design miss or impl. limit?
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Why Lucene's Suggest API can ONLY load field terms which is Store.YES?

2019-12-27 Thread
I have a document `category` field, which is a "|,;" separator separated
string, in indexing phase, i do manually split the value into atomic terms
and index as StringField, & i also add a same name StoredField which
contains original value form:





*List terms = analyzer.analysis((String)fieldValue); for(String
term: terms) {  doc.add(new StringField(fieldName, term, Store.NO));
}doc.add(new StoredField(fieldName, (String)fieldValue));*

Then i use Suggest API to load this field's all terms:















*Set terms = new HashSet();
DocumentDictionary dict = new DocumentDictionary(this.indexReader,
fieldName, null);InputIterator it;try {it =
dict.getEntryIterator();//BytesRef byteRef = null;
  while((byteRef = it.next()) != null){String term
= byteRef.utf8ToString();terms.add(term);}
  } catch (IOException e) {e.printStackTrace();
log.error(e.getMessage(), e);}*

To my supprise, terms seems only returning the STORED value, which is the
original value form, but i expect they should be the terms i put in each
StringField!

Is this a design miss or impl. limit?


How can i specify a custom Analyzer for a Field of Document?

2019-12-09 Thread
Directory indexDataDir = FSDirectory.open(Paths.get("index_data"));
Analyzer analyzer = MyLuceneAnalyzerFactory.newInstance();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setRAMBufferSizeMB(256.0);
IndexWriter indexWriter = new IndexWriter(indexDataDir, iwc);

like the above code which i use for building index, the problem is, Why i
can only set 1 analyzer for the whole index?

I have a document set, most fields to index is only text type, suited for a
StandAnalyzer or a SmartChineseAnalyzer. But the problem is, i have a
special field which is a KeywordList type, like "A;B;C", which i hope i can
fully control the analyzing step.

How to do this in Lucene?


Re: Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

2019-12-04 Thread
Hi, adrien

As to my native impl. which combines inverted index and R-tree distance
query(index data is fully loaded into memory), i use a bound box to do
filter and then use concise "contains" check to filter, so they are both
"distance query" (or i call it "point nearby query")

I have implemented this custom lucene Query, which filters the POI's in
10KM distance range, and then convert them to a Lucene BitSetIterator, and
test its performance: back to 20ms/1000QPS, retest, increase to
15ms/1400QPS. (doesn't know why), but the initial Lucene's BKD index
performance is only 150ms/130QPS, so this is a big win!

NOTE: I first subclassed the IndexSearher, and overridden the so called
"Low-Level" *search(Query query, Collector results)* method, and thought
Lucene would pass the defractored my custom Query object in. Well, I'm
wrong. But the custom Query subclass method finally works!

But the problem is, why is BKD index supported LatLonPoint.newDistanceQuery
's perf so bad? My 1w8 POIs' index data is only ~7MB on disk, so it's only
in 1 lucene "segment"? When loading them all into memory using mmap codec,
BKD index is stupidly scanning all POI locations? But this is only a guess.

BTW, the text-only query is avg 10ms/2000QPS, at the same level, in my
native in-memory inverted index and Lucene's index.


Adrien Grand  于2019年12月4日周三 下午4:14写道:

> Are you sure you are comparing apples to apples? The first paragraph
> mentions a range filter, which would be LatLonPoint#newBoxQuery, but
> then you mentioned LatLonPoint#newDistanceQuery, which is
> significantly more costly due to the need to compute distances.
>
> If you plan to combine text queries with your geo queries, I'd also
> advise to index both with LatLonPoint and LatLonDocValuesField, and
> then use IndexOrDocValuesQuery at query time. Typically something like
> this:
>
> ```
> Query textQuery = ...;
> Query latLonPointQuery = LatLonPoint.newBoxQuery("poi", www, xxx, yyy,
> zzz);
> Query latLonDocValuesQuery =
> LatLonDocValuesField.newSlowBoxQuery("poi", www, xxx, yyy, zzz);
> Query poiQuery = new IndexOrDocValuesQuery(latLonPointQuery,
> latLonDocValuesQuery);
> Query query = new BooleanQuery.Builder()
> .add(textQuery, Occur.MUST)
> .add(poiQuery, Occur.FILTER)
> .build();
> ```
>
> On Wed, Dec 4, 2019 at 5:31 AM 小鱼儿  wrote:
> >
> > Background: i need to implement a document indexing and search for
> > POIs(point of interest) under LBS scene. A POI has name, address, and
> > location(LatLonPoint), and i want to combine a text query with a
> > geo-spatial 2d range filter.
> >
> > The problem is, when i first build a native in-memory index which use a
> > simple BitSet as DocIDSet type and STRTree class from the famous JTS
> lib, i
> > get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64,
> use
> > mmap codec). But when i use Lucene-8.3 to implement the same
> > functionality(which use LatLonPoint.newDistanceQuery which seems use the
> > default BKD tree index), i only get 150ms/130qps which is a very bad
> > degrade?
> >
> > So my idea is, can i do a custom filter query, which builds a fully
> > in-memory R-tree index to boost the spatial2d range filter performance? I
> > need to access Lucene's internal DocIDSet class so i can do a fast merge
> > with no scoring needed. Hope this will improve the query performance.
> >
> > Any suggestions?
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

2019-12-03 Thread
Background: i need to implement a document indexing and search for
POIs(point of interest) under LBS scene. A POI has name, address, and
location(LatLonPoint), and i want to combine a text query with a
geo-spatial 2d range filter.

The problem is, when i first build a native in-memory index which use a
simple BitSet as DocIDSet type and STRTree class from the famous JTS lib, i
get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64, use
mmap codec). But when i use Lucene-8.3 to implement the same
functionality(which use LatLonPoint.newDistanceQuery which seems use the
default BKD tree index), i only get 150ms/130qps which is a very bad
degrade?

So my idea is, can i do a custom filter query, which builds a fully
in-memory R-tree index to boost the spatial2d range filter performance? I
need to access Lucene's internal DocIDSet class so i can do a fast merge
with no scoring needed. Hope this will improve the query performance.

Any suggestions?