the first version of SpatialIndex is ready!

Where to get it?

See attachment. It's still a very rough alpha version, don't use in production. The attached test case works only in my own testing environment, but you can see how the spatial index can be used.

What can it do?

You can insert your objects into the index along with a set of coordinates. These coordinates can be 2d, 3d and even n-d (I think). The coordinates can be points or bounding boxes. The data is paged to the database as needed, so you can create much bigger indices than would fit in memory. Later you can then query the index for objects in a given area or for the n-nearest objects to a certain point.

The underlying tree of the index is quite flexible, for example there are special variants of the tree which are "very well suited suited for mobile systems where queries on moving objects are common" [1]. The individual variants can also be customized extensively, e.g. you can change the leaf size, split factor, buffering and many more parameters. That's useful to minimize ConflictErrors.

For detailed information, please see http://pypi.python.org/pypi/Rtree , http://trac.gispython.org/spatialindex, http://anandmu.googlepages.com/comparison3finalreport.pdf and the spatialindex sources.

How does it work?

I've gone and extended the rtree c api a bit so it supports custom storages. This means you need the latest svn version if you want to test. I have windows binaries ready for those interested.

Then I wrote the code you see in spatialIndex.py. When you add an object to a SpatialIndex it generates a new entry with a random 64-bit integer and stores the mapping id->object into an LOBTree. The individual tree leaves are paged and stored via a custom rtree storage as binary data in an LOBTree which maps pageId->pageData.

A custom data manager makes sure the tree's buffers are flushed at transaction boundaries (commit/abort/savepoint/rollback). This ensures the spatial index is fully transactional.

All the heavy lifting is done by the rtree and spatialindex libraries. They are very well written in C++, so performance for the queries themselves is quite high. The slow part is fetching pages from ZODB (ZEO), but since objects which are located close to each other are likely to end up in the same leafs and pages, you take the "fetch from db" hit only for the first access and subsequent accesses will be cached (unless you are also writing a lot).

Known issues?

cleanness: The code is littered with ugly log statements. I've had a few problems with the data manager (see below).

ConflictErrors: If you have a tree with very few leaves (e.g. huge leave size or small number of objects) conflict errors will occur rather often, because the objects are saved to the same page. You can fine tune this by adjusting the rtree parameters (e.g. leaf_capacity, index_capacity, pagesize and near_minimum_overlap_factor).

data manager: I had to hook the transaction's join() method in a very ugly way, see the other post I made today for more about this.

entry ids: Entry ids are 64-bit. For very large trees this might be insufficient to prevent collisions. This is checked however and an error is raised when this happens.

performance: In some quick tests I did the index performed very well for my needs. You might want to write your own benchmark if you plan to really hammer the index.

packaging: I don't plan to create a package for this as I don't see much point in adding yet another package to the clutter of packages surrounding zodb.


[1] http://anandmu.googlepages.com/comparison3finalreport.pdf


Attachment: spatialIndex.py
Description: Binary data

Attachment: spatial_test.py
Description: Binary data

For more information about ZODB, see the ZODB Wiki:

ZODB-Dev mailing list  -  ZODB-Dev@zope.org

Reply via email to