[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-14 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579547#comment-16579547
 ] 

Ignacio Vera commented on LUCENE-8452:
--

{quote}we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - 
y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3)
{quote}
+1, it looks promising...

 
{quote}already have a WKT parser for {{LatLonShape}} - lines and polygons - 
that I can commit to luceneutil separately if interested
{quote}
Is there any reason not to add this utility to Lucene? It looks to me it would 
be very useful. (Note: We currently have the class 
{{SimpleGeoJSONPolygonParser}} which it seems to me it was added to support the 
GeoBenchmark for points).

 
{quote}I can extract a smaller set (e.g., 60M shapes to complement the 60M 
points in geobench)
{quote}
 

Awesome, but note that with 60M shapes we might not be able to compare the 
performance with spatial trees because it probably takes too long to index the 
data. 

 

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf, Lake.png, Park.png, River.png
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-14 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579481#comment-16579481
 ] 

Adrien Grand commented on LUCENE-8452:
--

This looks like a nice dataset for benchmarking indeed!

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf, Lake.png, Park.png, River.png
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-13 Thread Nicholas Knize (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578730#comment-16578730
 ] 

Nicholas Knize commented on LUCENE-8452:


+1 [~jpountz] I'm toying around with that approach a bit and can post some 
benchmark numbers when I have them.

As a side note (that may be of interest) I went ahead and extracted all 
linestrings, multilinestrings, and multipolygons from the latest planet OSM 
snapshot to run some local scale benchmarks and general tests with real world 
shape data. I converted the data from .pbf to WKT for easy ingest in luceneutil 
(and already have a WKT parser for {{LatLonShape}} - lines and polygons - that 
I can commit to luceneutil separately if interested). The data is quite large, 
and very good (real world w/ varying spatial extents, vertex counts, etc). If 
there is any interest I can extract a smaller set (e.g., 60M shapes to 
complement the 60M points in geobench) and make available for geo nightly 
benchmarks.

Here are the numbers for the entire corpus of data:
||Type||Count||File Size||
|{{LINESTRING}}|157,075,680|88GB|
|{{MULTILINESTRING}}|532,043|7.1GB|
|{{MULTIPOLYGON}}|351,975,024|164GB|

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-13 Thread Adrien Grand (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578650#comment-16578650
 ] 

Adrien Grand commented on LUCENE-8452:
--

bq. What it actually concerns me is that the dimensions are not independent.
bq. we could also investigate different encoding methods for LatLonShape 's 
tessellated triangles

Maybe we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - 
y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3)? In addition to 
making dimensions independent, it would also have the nice property that an 
index that only has points should perform almost as well as LatLonPoint since 
it would partition the space exactly the same way.

LUCENE-7862 is also a simple change that I would expect to help a lot with such 
high numbers of dimensions, especially when there is correlation.

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-10 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576408#comment-16576408
 ] 

Ignacio Vera commented on LUCENE-8452:
--

Thinking about this, we can make the format configurable with a flag. 

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-10 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576393#comment-16576393
 ] 

David Smiley commented on LUCENE-8452:
--

For WKT, there's already parsing in Spatial4J, a dependency of 
spatial-extras so the jar's should be there for the lucene-util benchmark 
to pick up.

This isn't to say Lucene *itself* should or shouldn't parse WKT but that's a 
new JIRA for sure and needn't delay the benchmarks.

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-10 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576386#comment-16576386
 ] 

Ignacio Vera commented on LUCENE-8452:
--

Before moving it into luceneutil I would like to agree in the format for the 
benchmark datasets. I guess we have two options:

1 )  WKT:  {{org.apache.lucene.geo.Polygon}} only supports GeoJSON, therefore 
we would have to add  a {{SimpleWKTPolygonParser}} to {{org.apache.lucene.geo}} 
package similar to {{SimpleGeoJSONPolygonParser}}. WKT seems a more compact 
structure.

2)  GeoJSON: Maybe the price of the verbosity of JSON is not too bad.

 

 

 

 

 

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-10 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576334#comment-16576334
 ] 

Michael McCandless commented on LUCENE-8452:


Maybe we can fold these new shape benchmarks into luceneutil so it runs nightly?

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-09 Thread Nicholas Knize (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575338#comment-16575338
 ] 

Nicholas Knize commented on LUCENE-8452:


bq. the dimensions are not independent.

That's what I mean by "...due to the dimensional expansion issue (especially 
with point shapes)." For point shapes, {{LatLonShape}} does the opposite of 
Principal Component Analysis. Instead of reducing dimensionality it expands it 
from two to six and subsequently perfectly correlates dimensions 1, 3, 5 and 2, 
4, 6. It is not at all optimal and not at all expected to perform anywhere 
close to a structure optimized for true two dimension point space. In fact it 
goes against every concept of a optimal structure.

bq. to see how the BKD tree behaves for different dimensions

As I mentioned, the split/merge algorithms for BKD are naive for ease of 
implementation. Given the lack of heuristics along dimensions I'd expect 
QPS/HPS to drop as more dimensions are added.  Add a high level of correlation 
between dimensions and I'd expect it to get even worse. Certainly, for high 
dimension indexing, work will need to be done to optimize the BKD structure. 
Alternatively, we can work on a separate codec (possibly a R*/X/PR-tree hybrid) 
derived from the current BKD implementation that is designed for higher 
dimension space - since the kd structure itself was not designed for such 
applications. And subsequently restrict BKD to two dimensions only. 

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
> Attachments: BKDperf.pdf
>
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-09 Thread Nicholas Knize (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574980#comment-16574980
 ] 

Nicholas Knize commented on LUCENE-8452:


Thanks for putting this together [~ivera]. Its a great start to benchmarking 
BKD with data higher than two dimensions. In fact, I don't think we've ever 
fully benchmarked past two dimensions, and the current POINTS implementation 
supports up to eight.  For this particular geo usecase it would be nice to put 
together some shape benchmarks with more representative real world data, along 
with different relation queries, and add to the [nightly geo 
benchmarks|https://home.apache.org/~mikemccand/geobench.html]. A good 
representative set I've been looking at for some time is the [osm 
data|http://example.com] which contain a fair number of linestrings, closed 
linestrings, and polygons with number of vertices much higher than 4.

This first benchmarking step certainly highlights the deficiency of the BKD 
split and merge algorithms for high dimension data. Its a side effect of their 
naivety; which was chosen simply for ease of implementation. And indexing 
{{LatLonShape}} with only point and bounding box data is going to highlight 
those deficiencies quite spectacularly - if only due to the dimensional 
expansion issue (especially with point shapes). There are a lot of 
optimizations in the literature that could (and should) be applied for well 
past two dimensions. For example 
[x-tree|https://pdfs.semanticscholar.org/d1fd/9d524ca177e16b0909def97025feaeffac62.pdf]
 has some nice heuristics for avoiding the sliver problem that occurs much more 
frequently in high dimension data. And we could also investigate different 
encoding methods for {{LatLonShape}} 's tessellated triangles. For now, it 
would be good to capture performance across various datasets and carefully note 
in the javadocs what tradeoffs should be considered for certain datasets (i.e., 
don't use {{LatLonShape}} if you're only indexing/searching points). In the 
meantime, we can investigate best approaches for strengthening the BKD 
split/merge algorithms for the general high dimension use-case (similar to what 
was done for the [1d 
usecase|https://issues.apache.org/jira/browse/LUCENE-7396]) and possibly 
revisit the idea of a [new 
codec|https://issues.apache.org/jira/browse/LUCENE-7110] derived from BKD that 
is intended for the higher dimension usecase.

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks

2018-08-09 Thread Ignacio Vera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574539#comment-16574539
 ] 

Ignacio Vera commented on LUCENE-8452:
--

I think one of the keys to understand the performance of BKD-based shape 
indexing is to analyse  how 6-d bounding boxes are transformed into 2-d 
bounding boxes.  Our input is a 6d point representing the minimum value and 6d 
point representing the maximum value of a bounding box and we need to transform 
it into a 2d bounding box to check the relationship with the query shape. The 
logic to get the 2d bounding box goes as follow: 

In a 6-d points we have three dimensions corresponding to latitudes (0, 2, 4) 
and the other three for longitude (1,3,5). To build the 2d minimum point of the 
bounding box from the 6d minimum point, we compute the lat and lon using the 
minimum value of the dimensions for the corresponding lat/lon entries.  To 
build the 2d maximum point of the bounding box from the 6d maximum point, we 
compute the lat and lon using the maximum value of the dimensions for the 
corresponding lat/lon entries.  

For example: We have the following bounding box in 6d and the corresponding 
bounding box in 2d:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90)  -> (90,90)

 

Split first dimension in half:

 

Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (*45*,90,90,90,90,90)  -> (90,90)

Right tree:

Min: (*45*,0,0,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90)  -> (90,90)

 

Using left tree, split  third dimension in half :

 

Left-Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (*45*,90,*45*,90,90,90)  -> (90,90)

Left-Right tree:

Min: (*45*,0,*45*,0,0,0) -> (0,0)

Max: (90,90,90,90,90,90)  -> (90,90)

 

Using left tree, split fifth dimension in half :

 

Left-Left-Left tree:

Min: (0,0,0,0,0,0) -> (0,0)

Max: (*45*,90,*45*,90,*45*,90)  -> (*45*,90)

Left-Left-Right tree:

Min: (*45*,0,*45*,0,*45*,0) -> (*45*,0)

Max: (90,90,90,90,90,90)  -> (90,90)

 

 We have to split all three dimension for latitude to actually have an effect 
on the 2-d bounding box. Which means that we need to visit many nodes of the 
tree before we actually filter out some of the branches.

[~nknize], does it make sense to you?

> BKD-based shape indexing benchmarks
> ---
>
> Key: LUCENE-8452
> URL: https://issues.apache.org/jira/browse/LUCENE-8452
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Ignacio Vera
>Priority: Major
>
> Initial benchmarking of the new BKD-based shape indexing suggest that 
> searches can be somewhat under-performing.   I open this ticket to share the 
> findings and to open a discussion how to speed up the solution.
>  
> The first benchmark is done by using the current benchmark in luceneutils for 
> indexing points and search by bounding box. We would expect {{LatLonShape}} 
> to be slower that {{LatLonPoint}} but still having a good performance. The 
> results of running such benchmark in my computer looks like:
>  
> LatLonPoint:
> 89.717239531 sec to index
> INDEX SIZE: 0.5087761553004384 GB
> READER MB: 0.6098232269287109
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 72.91056132596746
> BEST QPS: 74.19031323419311 
>  
> LatLonShape:
> 89.388678805 sec to index
> INDEX SIZE: 1.3028179928660393 GB
> READER MB: 0.8827085494995117
> maxDoc=60844404
> totHits=221118844
> BEST M hits/sec: 1.0053836784184809
> BEST QPS: 1.0230305276205143
>  
> A second benchmark has been performed indexing around 10 million 4-side 
> polygons and around 3 million points. Searches are performed using bounding 
> boxes. The results are compared with spatial trees alternatives. Spatial 
> trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25:
>  
> s2 (Geo3d):
> 1191.732124301 sec to index part 0
> INDEX SIZE: 3.2086284114047885 GB
> READER MB: 19.453557014465332
> maxDoc=12949519
> totHits=705758537
> BEST M hits/sec: 13.311369588840462
> BEST QPS: 4.243743434150063
>  
> quad (JTS):
> 3252.62925159 sec to index part 0
> INDEX SIZE: 4.523800031355 GB
> READER MB: 41.15725612640381
> maxDoc=12949519
> totHits=705758357
> BEST M hits/sec: 35.54591930673003
> BEST QPS: 11.332252412866938
>  
> LatLonShape:
> 30.32712009 sec to index part 0
> INDEX SIZE: 0.5627057952806354 GB
> READER MB: 0.29498958587646484
> maxDoc=12949519
> totHits=705758228
> BEST M hits/sec: 3.4130465326433357
> BEST QPS: 1.0880999177593018
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org