8GB is workable. Make sure you use prepared statements to avoid recompiling you
insert 500 million times. Also with this much data, it would probably be a very
good idea to compile SQLite with a much larger memory cache. Don't expect a
miracle either. 500 million is a very large number, any way you look at it.
SQLite is fast, but your queries will still take time.
With that in mind, I THINK this should work for you:
CREATE TABLE points
(
x double,
y double
);
CREATE INDEX idx_points_x_by_y ON points (x, y);
Using this, you can do a fully indexed query for any bounding rect:
SELECT x, y FROM points WHERE (x BETWEEN ? AND ?) AND (y BETWEEN ? AND ?)
(I recommend using EXPLAIN to verify that this DOES use the index, but it
should.)
If you need more advanced spatial calculations, you can do something like this,
to first take advantage of the index, and then do any additional calculations
to filter the return:
SELECT x, y, MYSPATIALFUNC(x, y) AS foo FROM points WHERE (x BETWEEN ? AND ?)
AND (y BETWEEN ? AND ?) AND foo < ?
Then just define a custom function for MYSPATIALFUNC.
If you need to select the nearest point, you can query a small region of space
around that point, and compute the nearest point from the small set returned
(use a custom function for maximum speed.) If the region is empty, enlarge it
and try again. If the "nearest" point in the region is further than the nearest
edge, enlarge the region enough to verify that it really IS the nearest point.
Not a perfect process, but simple enough, and it works without writing a whole
lot of extra code. You will need to evaluate the data to determine what the
ideal "region" size is for determining nearest points. Too small and you'll
have to re-query a lot, too large and you'll have to sift through a lot of data
points.
Hope that helps,
John
-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Igor Conom
Sent: Saturday, October 17, 2009 4:56 PM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] Creating a spatial index for a large number of
points-sorry for the text formating
John, thank you for the comments !
Maybe I wasn't clear - the 10TB data is separated. It contains a lot of other
data that I don't dream of storing in a database. But this bulk data is
structured in fixed-length records, each record containing a vector of floating
point values and some associated meta-information. Part of this meta-info is a
pair of points, (x1,y1) and (x2,y2). The number records is in the range of few
hundred millions (say 500 mil), which is the number of coordinate pairs I need
to handle. The coordinates are represented as a 4-byte IEEE floating point, so
500 mil will take 500*4*4 ~ 8GB.
So I'll have to process about 8GB of points.
These points have common coordinates for (x1,y1). The number of groups
(distinct (x1,y1)) is in the range of few hundred thousand (say 1mil). Now, in
this "index file" that I try to create, I'll have about 1 mil entries, each
entry containing somehow the (x1,y1) and a list of integers, which are really
record numbers to the original data set. So if nothing else is stored, the
entire index file should be about 8GB in our case. From what I read, sqlite3
should handle fairly well a db few GB large, containing few mil records. The
reality is that my "grups" table will store big rectangles, containing many
groups (I'm thinking about 1000). So I expect that while the total size of the
database is the same (few GB hopefully), the number of records in the r-tree
and associated data table to be in the few thousands range).
There are few reasons I sqlite is a candidate for this job:
1. This "index" that I try to build needs to be stored, since it will be used
multiple times for further processing, so any "in-memory" structure that I may
use needs to be stored on disk at some point in some form.
2. The process of indexing may take some time (few hours, to one day - as you
read through 10TB of data). If the process gets interrupted somehow, I need to
be able to restart it gracefully. Here the journaling feature of sqlite comes
in very handy: ex. I process the input in chunks of few tens of MB of points,
each chunk begins a transaction; if something happen, all the tables involved
remains in a consistent state, and I can just restart with the last chunk
processed.
3. The database is a single self-contained file, which makes it easy to
integrated it in a bigger project.
4. I used it before for something else :)
The points number 1 and 2 are important. So, unless I'll implement my own
storage system (think journal, think page management, etc.), I need to find
something else to build my app in top of it. The other candidates I considered
were HD5, and ... well I heard firebird can be used in serverless mode. Any
other ideas are welcome :)
-IC
----- Original Message ----
From: John Crenshaw <[email protected]>
To: General Discussion of SQLite Database <[email protected]>
Sent: Sat, October 17, 2009 2:33:30 PM
Subject: Re: [sqlite] Creating a spatial index for a large number of points-
sorry for the text formating
I doubt SQLite is the right tool for this job, for a number of reasons.
First, if the data is as simple as you say, you are probably better off writing
your logic as straight C, rather than SQL. SQLite is VERY fast, but there is
still an incredible amount of overhead in executing a query, in any database.
It will take many times more operations to do this with SQL than it would to do
it without. The strength of SQL is in abstracting complex access to complex
data. Simple lists of datapoints that only need a few simple transformations
should probably be handled differently.
Second, SQLite has in-memory overhead based on the size of the database. 10TB
worth of database would require a LOT of memory. Tried to find the numbers on
this, but I couldn't. I think it was something like 1 byte per page, with each
page 1KB by default. That would be a 10GB of ram required for your data, and
that assuming that SQLite stores it as efficiently as it is now (unlikely).
Third, SQLite has a 2TB limit, using the default page size. By increasing the
page size you can raise that limit, but you are entering wild territory.
I'm a huge fan of SQLite, but I wouldn't use it for the job you described.
John
-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Igor Conom
Sent: Saturday, October 17, 2009 10:10 AM
To: General Discussion of SQLite Database
Subject: Re: [sqlite] Creating a spatial index for a large number of points-
sorry for the text formating
Sorry for the formatting - it looked better when I sent it from Yahoo's web
interface.
----- Original Message ----
From: Igor Conom <[email protected]>
To: [email protected]
Sent: Sat, October 17, 2009 9:03:54 AM
Subject: [sqlite] Creating a spatial index for a large number of points
Hello everybody!
I need to create a spatial index for a large number of points and I would
really like to hear your comments. Thank you in advance for your time.
There are some very large files containing scientific data, stored in a very
simple format (by very large I mean things in the range of 10 TB). One of
these files is made of
equal-length records, each record containing a fix-length header and a vector
of floating point numbers. For each record there are two spatial points
associated: x1,y1 and x2,y2. There are few hundred millions of such records.
What I need is to create an index such that it allows me to group all records
with the same x1,y1 and do various operations (by same x1,y1 I mean points
within some
predefined range, i.e. two points can be considered the same if the distance
between the two is less than some specified constant).
The operations are:
1. for a given (x,y), quickly find the
record group with the closest (x1,y1).
2. given a rectangle, find the list of
(x1,y1) groups withing that rectangle.
The index needs to be build in a reasonable amount of time (close to the time
it takes to read through the original data). Preferably a single pass through
the data should
be required – but this is second to building an efficient index. Once created,
the index can be used only read-only, so one can take some extra time building
it.
I expect that the number of distinct (x1,y1) groups will be in the range of few
millions, with maybe few hundred records associated for each group. A record
can be uniquely
identified by a 32-bit integer (its position on the file(s)), so a group will
be defined by (x1,y1) and a list of integers.
My initial plan is to create two tables: one storing big chunks of groups and
one r-tree. The “big chunks” will be a list of many groups contained in a
larger rectangular
area, that I can further process in memory. I start with a big rectangle
representing a huge enough boundary to contain every point. As I add a point, I
use the r-tree to find a suitable rectangle for it, then add the point point to
its list. If the list grows over a give number of points (or size in bytes), I
split the rectangle in 4 parts, from the median point, so each part will
contain a balanced number of points).
Another thing: I can use integer coordinates: I know that relative to a good
choice of origin and cell size, the areal extent will not contain more than
2^32 cells on each
axis (from INT_MIN to INT_MAX). I presume the operations are faster with
integers.
CREATE TABLE groups ( data BLOB);
CREATE VIRTUAL TABLE groups_rt USING rtree_i32 (Id INTEGER, minX INTEGER, maxX
INTEGER, minY INTEGER, maxY INTEGER);
The groups_rt.Id will be groups.ROWID .
The algorithm to create the index is
described next:
for each input point:
if is the first point:
insert the first entry to the groups table and
insert a big rectangle in the r-tree (i.e. INT_MIN, INT_MAX, INT_MIN,
INT_MAX)
else:
transform x1,y1 to integers: ix1, iy1
SELECT groups.ROWID, data, minX,maxX,minY,maxX FROM groups, groups_rt
WHERE
groups.ROWID=groups_rt.Id AND maxX>= ix1 AND minX<=(ix1+1) AND
maxY>= iy1 AND minY<=(iy1+1)
There will always be at least one row and at most 4 rows in the result.
Pick the closest rectangle to the point (closest in real coordinates).
If the groups.data blob already contains too many points:
(or is bigger than some number of bytes .ie. 4K),
remove it from database and split the rectangle in 4 part,
then add them to the database.
Add the point to the right blob of data.
endif
end for
There is some additional optimization:
based on the fact that I process a bunch of points at a time and they
are most likely to be close to each other I keep the current
blob/rect in memory and keep adding points to it until a different
one is needed. This saves some updates to the database.
Right now I'm thinking what structure to use for the blob: for example, knowing
that more points will be added later, is it worth allocating a bigger chunk
initially so more points can be added to it.
I'd really like to hear some comments or new ideas. It seems to be a little
complicated so maybe I miss something. The rectangles stored in the r-tree are
always non-overlapping: can this property be used for some optimization? Is
Sqlite the right choice for the job? I'm reluctant to build my own disk format
– only dealing with all the cache management, journaling, fragmentation will
add significant development time .... and bugs.
Thank you for your time again, if you read this far :)
Cheers,
-IC
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users