[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579547#comment-16579547 ] Ignacio Vera commented on LUCENE-8452: -- {quote}we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3) {quote} +1, it looks promising... {quote}already have a WKT parser for {{LatLonShape}} - lines and polygons - that I can commit to luceneutil separately if interested {quote} Is there any reason not to add this utility to Lucene? It looks to me it would be very useful. (Note: We currently have the class {{SimpleGeoJSONPolygonParser}} which it seems to me it was added to support the GeoBenchmark for points). {quote}I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench) {quote} Awesome, but note that with 60M shapes we might not be able to compare the performance with spatial trees because it probably takes too long to index the data. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf, Lake.png, Park.png, River.png > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579481#comment-16579481 ] Adrien Grand commented on LUCENE-8452: -- This looks like a nice dataset for benchmarking indeed! > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf, Lake.png, Park.png, River.png > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578730#comment-16578730 ] Nicholas Knize commented on LUCENE-8452: +1 [~jpountz] I'm toying around with that approach a bit and can post some benchmark numbers when I have them. As a side note (that may be of interest) I went ahead and extracted all linestrings, multilinestrings, and multipolygons from the latest planet OSM snapshot to run some local scale benchmarks and general tests with real world shape data. I converted the data from .pbf to WKT for easy ingest in luceneutil (and already have a WKT parser for {{LatLonShape}} - lines and polygons - that I can commit to luceneutil separately if interested). The data is quite large, and very good (real world w/ varying spatial extents, vertex counts, etc). If there is any interest I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench) and make available for geo nightly benchmarks. Here are the numbers for the entire corpus of data: ||Type||Count||File Size|| |{{LINESTRING}}|157,075,680|88GB| |{{MULTILINESTRING}}|532,043|7.1GB| |{{MULTIPOLYGON}}|351,975,024|164GB| > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578650#comment-16578650 ] Adrien Grand commented on LUCENE-8452: -- bq. What it actually concerns me is that the dimensions are not independent. bq. we could also investigate different encoding methods for LatLonShape 's tessellated triangles Maybe we could make them independent, eg. by indexing (x1, y1, x2 - x1, y2 - y1, x3 - x1, y3 - y1) instead of (x1, y1, x2, y2, x3, y3)? In addition to making dimensions independent, it would also have the nice property that an index that only has points should perform almost as well as LatLonPoint since it would partition the space exactly the same way. LUCENE-7862 is also a simple change that I would expect to help a lot with such high numbers of dimensions, especially when there is correlation. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576408#comment-16576408 ] Ignacio Vera commented on LUCENE-8452: -- Thinking about this, we can make the format configurable with a flag. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576393#comment-16576393 ] David Smiley commented on LUCENE-8452: -- For WKT, there's already parsing in Spatial4J, a dependency of spatial-extras so the jar's should be there for the lucene-util benchmark to pick up. This isn't to say Lucene *itself* should or shouldn't parse WKT but that's a new JIRA for sure and needn't delay the benchmarks. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576386#comment-16576386 ] Ignacio Vera commented on LUCENE-8452: -- Before moving it into luceneutil I would like to agree in the format for the benchmark datasets. I guess we have two options: 1 ) WKT: {{org.apache.lucene.geo.Polygon}} only supports GeoJSON, therefore we would have to add a {{SimpleWKTPolygonParser}} to {{org.apache.lucene.geo}} package similar to {{SimpleGeoJSONPolygonParser}}. WKT seems a more compact structure. 2) GeoJSON: Maybe the price of the verbosity of JSON is not too bad. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576334#comment-16576334 ] Michael McCandless commented on LUCENE-8452: Maybe we can fold these new shape benchmarks into luceneutil so it runs nightly? > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575338#comment-16575338 ] Nicholas Knize commented on LUCENE-8452: bq. the dimensions are not independent. That's what I mean by "...due to the dimensional expansion issue (especially with point shapes)." For point shapes, {{LatLonShape}} does the opposite of Principal Component Analysis. Instead of reducing dimensionality it expands it from two to six and subsequently perfectly correlates dimensions 1, 3, 5 and 2, 4, 6. It is not at all optimal and not at all expected to perform anywhere close to a structure optimized for true two dimension point space. In fact it goes against every concept of a optimal structure. bq. to see how the BKD tree behaves for different dimensions As I mentioned, the split/merge algorithms for BKD are naive for ease of implementation. Given the lack of heuristics along dimensions I'd expect QPS/HPS to drop as more dimensions are added. Add a high level of correlation between dimensions and I'd expect it to get even worse. Certainly, for high dimension indexing, work will need to be done to optimize the BKD structure. Alternatively, we can work on a separate codec (possibly a R*/X/PR-tree hybrid) derived from the current BKD implementation that is designed for higher dimension space - since the kd structure itself was not designed for such applications. And subsequently restrict BKD to two dimensions only. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > Attachments: BKDperf.pdf > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574980#comment-16574980 ] Nicholas Knize commented on LUCENE-8452: Thanks for putting this together [~ivera]. Its a great start to benchmarking BKD with data higher than two dimensions. In fact, I don't think we've ever fully benchmarked past two dimensions, and the current POINTS implementation supports up to eight. For this particular geo usecase it would be nice to put together some shape benchmarks with more representative real world data, along with different relation queries, and add to the [nightly geo benchmarks|https://home.apache.org/~mikemccand/geobench.html]. A good representative set I've been looking at for some time is the [osm data|http://example.com] which contain a fair number of linestrings, closed linestrings, and polygons with number of vertices much higher than 4. This first benchmarking step certainly highlights the deficiency of the BKD split and merge algorithms for high dimension data. Its a side effect of their naivety; which was chosen simply for ease of implementation. And indexing {{LatLonShape}} with only point and bounding box data is going to highlight those deficiencies quite spectacularly - if only due to the dimensional expansion issue (especially with point shapes). There are a lot of optimizations in the literature that could (and should) be applied for well past two dimensions. For example [x-tree|https://pdfs.semanticscholar.org/d1fd/9d524ca177e16b0909def97025feaeffac62.pdf] has some nice heuristics for avoiding the sliver problem that occurs much more frequently in high dimension data. And we could also investigate different encoding methods for {{LatLonShape}} 's tessellated triangles. For now, it would be good to capture performance across various datasets and carefully note in the javadocs what tradeoffs should be considered for certain datasets (i.e., don't use {{LatLonShape}} if you're only indexing/searching points). In the meantime, we can investigate best approaches for strengthening the BKD split/merge algorithms for the general high dimension use-case (similar to what was done for the [1d usecase|https://issues.apache.org/jira/browse/LUCENE-7396]) and possibly revisit the idea of a [new codec|https://issues.apache.org/jira/browse/LUCENE-7110] derived from BKD that is intended for the higher dimension usecase. > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (LUCENE-8452) BKD-based shape indexing benchmarks
[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574539#comment-16574539 ] Ignacio Vera commented on LUCENE-8452: -- I think one of the keys to understand the performance of BKD-based shape indexing is to analyse how 6-d bounding boxes are transformed into 2-d bounding boxes. Our input is a 6d point representing the minimum value and 6d point representing the maximum value of a bounding box and we need to transform it into a 2d bounding box to check the relationship with the query shape. The logic to get the 2d bounding box goes as follow: In a 6-d points we have three dimensions corresponding to latitudes (0, 2, 4) and the other three for longitude (1,3,5). To build the 2d minimum point of the bounding box from the 6d minimum point, we compute the lat and lon using the minimum value of the dimensions for the corresponding lat/lon entries. To build the 2d maximum point of the bounding box from the 6d maximum point, we compute the lat and lon using the maximum value of the dimensions for the corresponding lat/lon entries. For example: We have the following bounding box in 6d and the corresponding bounding box in 2d: Min: (0,0,0,0,0,0) -> (0,0) Max: (90,90,90,90,90,90) -> (90,90) Split first dimension in half: Left tree: Min: (0,0,0,0,0,0) -> (0,0) Max: (*45*,90,90,90,90,90) -> (90,90) Right tree: Min: (*45*,0,0,0,0,0) -> (0,0) Max: (90,90,90,90,90,90) -> (90,90) Using left tree, split third dimension in half : Left-Left tree: Min: (0,0,0,0,0,0) -> (0,0) Max: (*45*,90,*45*,90,90,90) -> (90,90) Left-Right tree: Min: (*45*,0,*45*,0,0,0) -> (0,0) Max: (90,90,90,90,90,90) -> (90,90) Using left tree, split fifth dimension in half : Left-Left-Left tree: Min: (0,0,0,0,0,0) -> (0,0) Max: (*45*,90,*45*,90,*45*,90) -> (*45*,90) Left-Left-Right tree: Min: (*45*,0,*45*,0,*45*,0) -> (*45*,0) Max: (90,90,90,90,90,90) -> (90,90) We have to split all three dimension for latitude to actually have an effect on the 2-d bounding box. Which means that we need to visit many nodes of the tree before we actually filter out some of the branches. [~nknize], does it make sense to you? > BKD-based shape indexing benchmarks > --- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Ignacio Vera >Priority: Major > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.523800031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org