Re: [sqlite] create unique index quickly

2006-05-22 Thread Jay Sprenkle

On 5/22/06, Micha Bieber <[EMAIL PROTECTED]> wrote:

> Interesting, but it doesn't have any adhoc query capability:

Take in consideration, that this is not the whole truth - simply
selecting and getting what the HDF5 authors call a 'hyperslab' is a
great feature and an example for a simple query task. This special task
alone might become an essential part of your problem.


I wouldn't classify ad hoc queries as 'simple'. If all you need is simple
queries this might be a much better choice. Not knowing any details of
how the data will be used makes it hard to judge which would be better
for the application. Thanks for pointing it out! :)


RE: [sqlite] create unique index quickly

2006-05-22 Thread Brannon King
Thank you for answering the question! Like I said in my other post, the
index creation is about 20% faster than the unique index creation in my
latest tests of an index of two INTEGERs. 20% out of several hours is
significant.

So index names are unique for the whole database? I was thinking they just
had to be unique for the table. I'll double check my names...

> > Is there some way I can create that unique index without 
> the overhead 
> > of uniqueness checking? I know all my triples are unique when used =
> > together: I
> > generated them that way. Or is there some other way I can 
> create that 
> > = index faster?
> > 
> 
> The following technique is unsupported.  It might change or 
> go away in a subsequent release.  It might not work.
> It might corrupt your database.  Use with extreme caution.
> 
>   (1)  CREATE INDEX primkey ON data(x,y,z)
>   (2)  PRAMGA writable_schema=ON
>   (3)  UPDATE sqlite_master 
>   SET sql='CREATE UNIQUE INDEX primkey ON data(x,y,z)'
> WHERE name='primkey'
>   (4)  close and reopen the database connection
> 
> The above technique is unsupported.  It might not work.
> It might corrupt your database.  The overhead of the 
> uniqueness check is not that much so it is unlikely to help 
> much anyway.  You have been warned.
> 
> --
> D. Richard Hipp
 



RE: [sqlite] create unique index quickly

2006-05-22 Thread Brannon King
> I am simply curious.  This sounds like an amazing 
> engineering challenge.  If it is not a secret, can you 
> describe what this data represents and how it will be used? 
Genomics. Look up "Smith-Waterman" or "Needleman-Wunsch-Sellers" on the web.

> What is the ultimate source of this data? 
Custom hardware
 
> How many days/weeks/eons of it do you plan to accumulate? 
Less than a week.

>   How much
> raw disk space is that?
It varies by thresholding. 1GB to several TB.
 
> If backups and journaling are not important, then is it 
> safe to assume that you can always regenerate that data on 
> demand?  Is each "set" of data identical, or only 
> statistically similar to prior sets?
I can always regenerate if I want to wait a few days. It's not exactly an
exact science anyway.

> Your project sounds like fun though, from what little 
> I've read of this thread.  Sure beats writing boring 
> financial software ;)
I admit I'm quite enjoying it as well.



Re: [sqlite] create unique index quickly

2006-05-22 Thread Jay Sprenkle

On 5/22/06, Brannon King <[EMAIL PROTECTED]> wrote:

> _Personally_ I think, this sounds like a task not quite
> fitting in sqlites (probably any 'standard' databases) realm.
> This is a bit off-topic in this group, but because you
> mention sub-boxes - did you ever look into more specialized
> file-formats like HDF5:
>
> http://hdf.ncsa.uiuc.edu/HDF5/


What a fantastic link and suggestion! I just may end up using it.


Interesting, but it doesn't have any adhoc query capability:


"Performance-wise, how does HDF5 compare to a relational database?

 It really depends on your application. HDF5 is tuned to do
efficient I/O and storage for "big" data (hundreds of megabytes and
more). It will not work well for small reads/writes.

 It doesn't have indexing capabilities, though we are working on
some limited features. See the HDF5_Prototype_Indexing_Requirements
for details.

 HDF5 was designed to complement DBs and not to compete with them. "


RE: [sqlite] create unique index quickly

2006-05-22 Thread Brannon King
> _Personally_ I think, this sounds like a task not quite 
> fitting in sqlites (probably any 'standard' databases) realm. 
> This is a bit off-topic in this group, but because you 
> mention sub-boxes - did you ever look into more specialized 
> file-formats like HDF5:
> 
> http://hdf.ncsa.uiuc.edu/HDF5/
> 
> with support for certain (of course limited) queries ?
> 
> Micha
> -- 
> 


What a fantastic link and suggestion! I just may end up using it.



Re: [sqlite] create unique index quickly

2006-05-22 Thread Jay Sprenkle

On 5/22/06, Dennis Jenkins <[EMAIL PROTECTED]> wrote:

Jay Sprenkle wrote:
>
> Me too!
>
> The two largest database setups I've worked with:
> the total of all the call records for a major phone company,
> and the cumulative records of most of the drugs bought in
> the United States, don't add up to as much as this.
> What are you sampling at 1/200th of a microsecond?


Jay, do you work for the NSA or the DEA or both?


LOL!

Worked as a consultant for Sprint and Argus Health systems.
They both do huge data warehouses.

Of course, if I told you the truth I'd have to kill you. ;)


Re: [sqlite] create unique index quickly

2006-05-22 Thread Micha Bieber
Monday, May 22, 2006, 15:17:21, Jay Sprenkle wrote:

> On 5/22/06, Dennis Jenkins <[EMAIL PROTECTED]> wrote:
>> Brannon King wrote:
>> > The benefits I'm trying to get out of sqlite are the data queries. I
>> > collect a large, sparse 2D array from hardware. The hardware device is
>> > giving me a few GB of data data at 200MB/s. Future hardware versions
>> > will be four times that fast and give me terabytes of data. After I
>> > have the data, I then have to go through and make calculations on
>> > sub-boxes of that data. (I'll post some more about that in a different
>> > response.) I was trying to avoid coding my own
>> > sparce-matrix-file-stream-mess that I would have to do if I didn't
>> > have a nice DB engine. I think sqlite will work. I think it will be
>> > fast enough. I'll have some nice RAID controllers on the production
>> > machines with 48-256MB caches.
>>
>> Hello Brannon,
>>
>> I am simply curious.  This sounds like an amazing engineering
>> challenge.  If it is not a secret, can you describe what this data
>> represents and how it will be used?

> Me too!

Me too too :)

_Personally_ I think, this sounds like a task not quite fitting in
sqlites (probably any 'standard' databases) realm. This is a bit
off-topic in this group, but because you mention sub-boxes - did you
ever look into more specialized file-formats like HDF5:

http://hdf.ncsa.uiuc.edu/HDF5/

with support for certain (of course limited) queries ?

Micha  
-- 



Re: [sqlite] create unique index quickly

2006-05-22 Thread Dennis Jenkins
Jay Sprenkle wrote:
>
> Me too!
>
> The two largest database setups I've worked with:
> the total of all the call records for a major phone company,
> and the cumulative records of most of the drugs bought in
> the United States, don't add up to as much as this.
> What are you sampling at 1/200th of a microsecond?


Jay, do you work for the NSA or the DEA or both?



Re: [sqlite] create unique index quickly

2006-05-22 Thread Jay Sprenkle

On 5/22/06, Dennis Jenkins <[EMAIL PROTECTED]> wrote:

Brannon King wrote:
> The benefits I'm trying to get out of sqlite are the data queries. I
> collect a large, sparse 2D array from hardware. The hardware device is
> giving me a few GB of data data at 200MB/s. Future hardware versions
> will be four times that fast and give me terabytes of data. After I
> have the data, I then have to go through and make calculations on
> sub-boxes of that data. (I'll post some more about that in a different
> response.) I was trying to avoid coding my own
> sparce-matrix-file-stream-mess that I would have to do if I didn't
> have a nice DB engine. I think sqlite will work. I think it will be
> fast enough. I'll have some nice RAID controllers on the production
> machines with 48-256MB caches.

Hello Brannon,

I am simply curious.  This sounds like an amazing engineering
challenge.  If it is not a secret, can you describe what this data
represents and how it will be used?


Me too!

The two largest database setups I've worked with:
the total of all the call records for a major phone company,
and the cumulative records of most of the drugs bought in
the United States, don't add up to as much as this.
What are you sampling at 1/200th of a microsecond?


Re: [sqlite] create unique index quickly

2006-05-22 Thread Dennis Jenkins
Brannon King wrote:
> The benefits I'm trying to get out of sqlite are the data queries. I
> collect a large, sparse 2D array from hardware. The hardware device is
> giving me a few GB of data data at 200MB/s. Future hardware versions
> will be four times that fast and give me terabytes of data. After I
> have the data, I then have to go through and make calculations on
> sub-boxes of that data. (I'll post some more about that in a different
> response.) I was trying to avoid coding my own
> sparce-matrix-file-stream-mess that I would have to do if I didn't
> have a nice DB engine. I think sqlite will work. I think it will be
> fast enough. I'll have some nice RAID controllers on the production
> machines with 48-256MB caches.

Hello Brannon,

I am simply curious.  This sounds like an amazing engineering
challenge.  If it is not a secret, can you describe what this data
represents and how it will be used? 

What is the ultimate source of this data? 

How many days/weeks/eons of it do you plan to accumulate?   How much
raw disk space is that?

If backups and journaling are not important, then is it safe to
assume that you can always regenerate that data on demand?  Is each
"set" of data identical, or only statistically similar to prior sets?

Your project sounds like fun though, from what little I've read of
this thread.  Sure beats writing boring financial software ;)



Re: [sqlite] create unique index quickly

2006-05-22 Thread drh
"Brannon King" <[EMAIL PROTECTED]> wrote:
> The statement
> CREATE UNIQUE INDEX primkey ON 4dData (x,y,z)
> is too slow for me.
> 
> Is there some way I can create that unique index without the overhead of
> uniqueness checking? I know all my triples are unique when used =
> together: I
> generated them that way. Or is there some other way I can create that =
> index
> faster?
> 

The following technique is unsupported.  It might change
or go away in a subsequent release.  It might not work.
It might corrupt your database.  Use with extreme caution.

  (1)  CREATE INDEX primkey ON data(x,y,z)
  (2)  PRAMGA writable_schema=ON
  (3)  UPDATE sqlite_master 
  SET sql='CREATE UNIQUE INDEX primkey ON data(x,y,z)'
WHERE name='primkey'
  (4)  close and reopen the database connection

The above technique is unsupported.  It might not work.
It might corrupt your database.  The overhead of the
uniqueness check is not that much so it is unlikely
to help much anyway.  You have been warned.

--
D. Richard Hipp   <[EMAIL PROTECTED]>



Re: [sqlite] create unique index quickly

2006-05-21 Thread Jay Sprenkle

On 5/20/06, Brannon King <[EMAIL PROTECTED]> wrote:

The benefits I'm trying to get out of sqlite are the data queries. I
collect a large, sparse 2D array from hardware. The hardware device is
giving me a few GB of data data at 200MB/s. Future hardware versions
will be four times that fast and give me terabytes of data. After I have
the data, I then have to go through and make calculations on sub-boxes
of that data.


Have you thought about constructing a data warehouse for this? Use something
close to the metal (hash + binary file I/O) to store the initial data.
Then write
something to load what data you need into a database. You can then do your
queries against that. I think your overall bottle neck is indexing, and indexing
800MB/s sustained will take big iron or a beowulf cluster.


Re: [sqlite] create unique index quickly

2006-05-21 Thread Christian Smith

On Sat, 20 May 2006, Brannon King wrote:


John Stanton wrote:
You don't seem to need a data manipulation system like Sqlite, more a form 
of high volume storage.  Do you really need elaborate SQL, journalling, 
ROLLBACK and assured disk storage?


Di you consider some form of hashed storage, perhaps linear hashing, to 
build a compact and high performance associative array for your sparsely 
keyed data.


Do you really need the overhead of B-trees is you are just storing a sparse 
array?

JS
I don't need journaling or rollback. I'd love a way to shut them off. But 
elaborate SQL, that sure is handy. I'm not just storing, I'm viewing  stored, 
compressed data. I definitely need some way of querying a sparse matrix data 
that is larger than my DRAM. Sqlite sure seems like the quickest route to a 
workable product for that to happen. It has all the streaming/caching built 
in. Because of that, I assume it is faster than random file access. It 
supports complex data queries and indexes, both things I would need anyway. 
In the world of programming, I think many will agree you should get a working 
product, then make it faster. I'm just trying to get the most speed out of 
the easiest tool. If I need to rewrite the file storage for the next version, 
we can consider the cost to benefit for that separately.





Is that 800MB/s sustained? Is it constant, 24/7? Or is it bursty data?

If it's bursty, you could buffer the inserts in an append only log. 
Insertion to the SQL(ite) database can be done asynchronously. Writing to 
unstructured, append only log files can be done at pretty much disk I/O 
speed.


If it's not bursty, but sustained, then I fear SQLite or any other 
database will not be able to match that rate. You'll probably run out of 
CPU before running out of disk IO.


Christian

PS. On a side note, is it wise still buying SGI kit?


Re: [sqlite] create unique index quickly

2006-05-20 Thread John Stanton

Brannon King wrote:

John Stanton wrote:

You don't seem to need a data manipulation system like Sqlite, more a 
form of high volume storage.  Do you really need elaborate SQL, 
journalling, ROLLBACK and assured disk storage?


Di you consider some form of hashed storage, perhaps linear hashing, 
to build a compact and high performance associative array for your 
sparsely keyed data.


Do you really need the overhead of B-trees is you are just storing a 
sparse array?

JS


I don't need journaling or rollback. I'd love a way to shut them off. 
But elaborate SQL, that sure is handy. I'm not just storing, I'm 
viewing  stored, compressed data. I definitely need some way of querying 
a sparse matrix data that is larger than my DRAM. Sqlite sure seems like 
the quickest route to a workable product for that to happen. It has all 
the streaming/caching built in. Because of that, I assume it is faster 
than random file access. It supports complex data queries and indexes, 
both things I would need anyway. In the world of programming, I think 
many will agree you should get a working product, then make it faster. 
I'm just trying to get the most speed out of the easiest tool. If I need 
to rewrite the file storage for the next version, we can consider the 
cost to benefit for that separately.


I saw you performance requirements and data rate, which looks difficult 
to achieve when you are writing journals and ensuring the integrity of 
disk records.


You will find that Sqlite is much slower than random file access, 
because Sqlite is built on top of random file access.  You get random 
file access speed less all the overhead of Sqlite's journals and B-trees.


We have an application using storage something like yours and we use 
memory mapped areas with AVL trees for indexing.  If it needed to run as 
fast as yours we would probably use hashing rather than the binary 
trees.  A sparse index is realized by concatenated keys.  This method 
dynamically uses memory for caching, but is not limited to physical 
memory size, only virtual memory.  It assumes POSIX capabilities from 
the OS.


With hashing you avoid the overhead of B-tree balancing with insertions, 
but pay for it by not having the keys accessable in an ordered sequence.


Ask yourself and your users/customers whether the easiest solution or 
the best solution is the most satisfactory.


Re: [sqlite] create unique index quickly

2006-05-20 Thread Brannon King

John Stanton wrote:
You don't seem to need a data manipulation system like Sqlite, more a 
form of high volume storage.  Do you really need elaborate SQL, 
journalling, ROLLBACK and assured disk storage?


Di you consider some form of hashed storage, perhaps linear hashing, 
to build a compact and high performance associative array for your 
sparsely keyed data.


Do you really need the overhead of B-trees is you are just storing a 
sparse array?

JS
I don't need journaling or rollback. I'd love a way to shut them off. 
But elaborate SQL, that sure is handy. I'm not just storing, I'm 
viewing  stored, compressed data. I definitely need some way of querying 
a sparse matrix data that is larger than my DRAM. Sqlite sure seems like 
the quickest route to a workable product for that to happen. It has all 
the streaming/caching built in. Because of that, I assume it is faster 
than random file access. It supports complex data queries and indexes, 
both things I would need anyway. In the world of programming, I think 
many will agree you should get a working product, then make it faster. 
I'm just trying to get the most speed out of the easiest tool. If I need 
to rewrite the file storage for the next version, we can consider the 
cost to benefit for that separately.


Re: [sqlite] create unique index quickly

2006-05-20 Thread John Stanton
You don't seem to need a data manipulation system like Sqlite, more a 
form of high volume storage.  Do you really need elaborate SQL, 
journalling, ROLLBACK and assured disk storage?


Di you consider some form of hashed storage, perhaps linear hashing, to 
build a compact and high performance associative array for your sparsely 
keyed data.


Do you really need the overhead of B-trees is you are just storing a 
sparse array?

JS

Brannon King wrote:
The benefits I'm trying to get out of sqlite are the data queries. I 
collect a large, sparse 2D array from hardware. The hardware device is 
giving me a few GB of data data at 200MB/s. Future hardware versions 
will be four times that fast and give me terabytes of data. After I have 
the data, I then have to go through and make calculations on sub-boxes 
of that data. (I'll post some more about that in a different response.) 
I was trying to avoid coding my own sparce-matrix-file-stream-mess that 
I would have to do if I didn't have a nice DB engine. I think sqlite 
will work. I think it will be fast enough. I'll have some nice RAID 
controllers on the production machines with 48-256MB caches.


 From experimentation, the "UNIQUE INDEX" command is only 20% slower 
than the "INDEX" command. That doesn't make sense to me. I would think 
the UNIQUE INDEX creation would be take longer because it has to do 
redundancy checks. And I still feel there should be a way to create a 
UNIQUE INDEX without waiting for the redundancy check.


Dennis Cote wrote

A more general question is why are you trying to use sqlite? If you 
need the

maximum possible speed you may be better off using in memory vectors and
maps and highly tuned routines like the C++ STL algorithms instead of
sqlite. Then you will can go back to using binary file I/O. SQLite adds
overhead above and beyond binary file I/O, it will always be slower. 
If the

benefits of sqlite don't outweigh the costs, you should stick with binary
I/O. What benefit are you hoping to get out of using sqlite?




Re: [sqlite] create unique index quickly

2006-05-20 Thread Brannon King


For your mail i think you have everything planned, but just for 
curiosity, how do you plan to pass data from hardware (800 MB/s + 
Database data) to a raid?? A GigEthernet has 1000 Mb/s (so, 128 MB/s), 
a PCI-X based solution has 1.3 to 2.6 Gb/s (so 150-300 MB/s), and 
similar for SATA and PCIe(x16). Even a direct Hypertransport to the 
raid has 3.2 Gb/s (. Note that i'm not asking for the raid, just for 
the SQLite hardware to Raid connection.


O ye of little faith. First, the hardware device. PCI64 based solutions 
run at 64bit x 66MHz with 50-70% efficiency. You can increase the clock 
to 133MHz using PCI-X. At 50% efficiency that is still 500MB/s. (My 
current solution runs the former and uses 50MB/s input.) Each channel in 
PCIe is 2.5Gb, apx 200MB/s. I assume I can get 60% efficiency on that. 
16 channels of that is more than enough for 800MB/s. Hypertransport is 
way faster than 3.2Gb/s these days. We've looked at HTX (Hypertransport) 
connections, which are near 1000MB/s 
(http://www.pathscale.com/infinipath.php). In addition, I actually have 
an SGI Altix 350 box in the office that uses NUMALink connections 
running at 6.5GB/s: 200MHz x 128bit at near 100% efficiency. You can buy 
one of those for a scant $13k ;-)


For the hard disk, you can buy SGI 
(http://www.sgi.com/products/storage/tech/6700.html) drive arrays that 
run at 3GB/s. How do they do that? Lots of hard drives connected to 
their NUMA controllers. At the Super Computer show I've seen 1U 8-disk 
arrays running at a steady 400MB/s using QLogic controllers and optical 
links for a scant $3k. Two of those and you've got it.


Re: [sqlite] create unique index quickly

2006-05-20 Thread Eduardo

At 21:28 20/05/2006, you wrote:
The benefits I'm trying to get out of sqlite are the data queries. I 
collect a large, sparse 2D array from hardware. The hardware device 
is giving me a few GB of data data at 200MB/s. Future hardware 
versions will be four times that fast and give me terabytes of data. 
After I have the data, I then have to go through and make 
calculations on sub-boxes of that data. (I'll post some more about 
that in a different response.) I was trying to avoid coding my own 
sparce-matrix-file-stream-mess that I would have to do if I didn't 
have a nice DB engine. I think sqlite will work. I think it will be 
fast enough. I'll have some nice RAID controllers on the production 
machines with 48-256MB caches.


From experimentation, the "UNIQUE INDEX" command is only 20% slower 
than the "INDEX" command. That doesn't make sense to me. I would 
think the UNIQUE INDEX creation would be take longer because it has 
to do redundancy checks. And I still feel there should be a way to 
create a UNIQUE INDEX without waiting for the redundancy check.


Do the inserts in a temp table memory based inside a transaction, 
create the index on this table and then pass the table to file. Also, 
you may check/tweak SQLite, as page size (put 32768 byte, default is 
1024), increase page cache (default is 2000, each page has 1.5*page 
bytes size, so do maths how many you can have), temp store (use 
memory). If not enough speed, you must or duplicate hardware (2 
SQLite works faster than one) or think about change to another 
database, i suggest using DB2, Valentina (a fast-low price database) 
or any other commercial database.


For your mail i think you have everything planned, but just for 
curiosity, how do you plan to pass data from hardware (800 MB/s + 
Database data) to a raid?? A GigEthernet has 1000 Mb/s (so, 128 
MB/s), a PCI-X based solution has 1.3 to 2.6 Gb/s (so 150-300 MB/s), 
and similar for SATA and PCIe(x16). Even a direct Hypertransport to 
the raid has 3.2 Gb/s (. Note that i'm not asking for the raid, just 
for the SQLite hardware to Raid connection.


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail? 



Re: [sqlite] create unique index quickly

2006-05-20 Thread Dennis Cote

On 5/18/06, Brannon King <[EMAIL PROTECTED]> wrote:


The statement
CREATE UNIQUE INDEX primkey ON 4dData (x,y,z)
is too slow for me. It takes 13 sec on 700k records on my Athlon XP 2GHz.
That's about four seconds longer than it took me to put the data in there.
The two times together are still faster than inserting into a table with a
primary key already declared.

Is there some way I can create that unique index without the overhead of
uniqueness checking? I know all my triples are unique when used together:
I
generated them that way. Or is there some other way I can create that
index
faster?

Thanks.
__
Brannon King
¯¯





Brannon,

You are getting almost 54K index entries per second using sqlite. How fast
do you need to go?

In another post you said:

I need to do billions (no joke) of consecutive inserts into this table.

Writing binary out with std::ofstream was using 26% of my code execution
time according to the profiler. I changed to sqlite3, and now the
sqlite3_step function uses 50% of my execution time. After reading the
news groups, I removed the primary key. That dropped it down to about
41%. That was significant. However, I was still significantly slower
than binary writes with ofstream.



A more general question is why are you trying to use sqlite? If you need the
maximum possible speed you may be better off using in memory vectors and
maps and highly tuned routines like the C++ STL algorithms instead of
sqlite. Then you will can go back to using binary file I/O. SQLite adds
overhead above and beyond binary file I/O, it will always be slower. If the
benefits of sqlite don't outweigh the costs, you should stick with binary
I/O. What benefit are you hoping to get out of using sqlite?

Dennis Cote


[sqlite] create unique index quickly

2006-05-18 Thread Brannon King
The statement
CREATE UNIQUE INDEX primkey ON 4dData (x,y,z)
is too slow for me. It takes 13 sec on 700k records on my Athlon XP 2GHz.
That's about four seconds longer than it took me to put the data in there.
The two times together are still faster than inserting into a table with a
primary key already declared.

Is there some way I can create that unique index without the overhead of
uniqueness checking? I know all my triples are unique when used together: I
generated them that way. Or is there some other way I can create that index
faster?

Thanks.
__
Brannon King
¯¯