[ https://issues.apache.org/jira/browse/LUCENE-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578730#comment-16578730 ]
Nicholas Knize edited comment on LUCENE-8452 at 8/13/18 6:19 PM: ----------------------------------------------------------------- +1 [~jpountz] I'm toying around with that approach a bit and can post some benchmark numbers when I have them. As a side note (that may be of interest) I went ahead and extracted all linestrings, multilinestrings, and multipolygons from the latest planet OSM snapshot to run some local scale benchmarks and general tests with real world shape data. I converted the data from .pbf to WKT for easy ingest in luceneutil (and already have a WKT parser for {{LatLonShape}} - lines and polygons - that I can commit to luceneutil separately if interested). The data is quite large, and very good (real world w/ varying spatial extents, vertex counts, etc). If there is any interest I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench) and make available for geo nightly benchmarks. Here are the numbers for the entire corpus of data: ||Type||Count||File Size|| |{{LINESTRING}}|157,075,680|88GB| |{{MULTILINESTRING}}|532,043|7.1GB| |{{MULTIPOLYGON}}|351,975,024|164GB| Here are three simple examples of the type of shape data contained in the planet OSM corpus (river, lake, and park polygons): !River.png! !Lake.png! !Park.png! was (Author: nknize): +1 [~jpountz] I'm toying around with that approach a bit and can post some benchmark numbers when I have them. As a side note (that may be of interest) I went ahead and extracted all linestrings, multilinestrings, and multipolygons from the latest planet OSM snapshot to run some local scale benchmarks and general tests with real world shape data. I converted the data from .pbf to WKT for easy ingest in luceneutil (and already have a WKT parser for {{LatLonShape}} - lines and polygons - that I can commit to luceneutil separately if interested). The data is quite large, and very good (real world w/ varying spatial extents, vertex counts, etc). If there is any interest I can extract a smaller set (e.g., 60M shapes to complement the 60M points in geobench) and make available for geo nightly benchmarks. Here are the numbers for the entire corpus of data: ||Type||Count||File Size|| |{{LINESTRING}}|157,075,680|88GB| |{{MULTILINESTRING}}|532,043|7.1GB| |{{MULTIPOLYGON}}|351,975,024|164GB| > BKD-based shape indexing benchmarks > ----------------------------------- > > Key: LUCENE-8452 > URL: https://issues.apache.org/jira/browse/LUCENE-8452 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox > Reporter: Ignacio Vera > Priority: Major > Attachments: BKDperf.pdf, Lake.png, Park.png, River.png > > > Initial benchmarking of the new BKD-based shape indexing suggest that > searches can be somewhat under-performing. I open this ticket to share the > findings and to open a discussion how to speed up the solution. > > The first benchmark is done by using the current benchmark in luceneutils for > indexing points and search by bounding box. We would expect {{LatLonShape}} > to be slower that {{LatLonPoint}} but still having a good performance. The > results of running such benchmark in my computer looks like: > > LatLonPoint: > 89.717239531 sec to index > INDEX SIZE: 0.5087761553004384 GB > READER MB: 0.6098232269287109 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 72.91056132596746 > BEST QPS: 74.19031323419311 > > LatLonShape: > 89.388678805 sec to index > INDEX SIZE: 1.3028179928660393 GB > READER MB: 0.8827085494995117 > maxDoc=60844404 > totHits=221118844 > BEST M hits/sec: 1.0053836784184809 > BEST QPS: 1.0230305276205143 > > A second benchmark has been performed indexing around 10 million 4-side > polygons and around 3 million points. Searches are performed using bounding > boxes. The results are compared with spatial trees alternatives. Spatial > trees use a composite strategy, precision=0.001 degrees and distErrPct=0.25: > > s2 (Geo3d): > 1191.732124301 sec to index part 0 > INDEX SIZE: 3.2086284114047885 GB > READER MB: 19.453557014465332 > maxDoc=12949519 > totHits=705758537 > BEST M hits/sec: 13.311369588840462 > BEST QPS: 4.243743434150063 > > quad (JTS): > 3252.62925159 sec to index part 0 > INDEX SIZE: 4.5238002222031355 GB > READER MB: 41.15725612640381 > maxDoc=12949519 > totHits=705758357 > BEST M hits/sec: 35.54591930673003 > BEST QPS: 11.332252412866938 > > LatLonShape: > 30.32712009 sec to index part 0 > INDEX SIZE: 0.5627057952806354 GB > READER MB: 0.29498958587646484 > maxDoc=12949519 > totHits=705758228 > BEST M hits/sec: 3.4130465326433357 > BEST QPS: 1.0880999177593018 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org