Re: [sqlite] Curious performance issue with large number of inserts

2007-05-08 Thread Tim Bradshaw

On 5 May 2007, at 17:11, Joe Wilson wrote:


What timings do you get when you run the perl script in my last email?


Your script inevitably was fine.  What seems to cause the trouble is  
something nasty in the pathnames.  I've modified my perl script to  
sanitize them by removing any non-ASCII characters and this makes it  
all just work.  Unfortunately I can't find the offending pathname.   
It would be interesting to know what went wrong - I'd imagine  
seriously bad things in strings would make it die.  It's possible  
that it missed a closing quote char and ended up eating a basically  
infinite string (though I'd expect it to grow much faster than it did  
in that case, if it grew at all).


So I guess this was basically an SQL injection attack in the form of  
carefully chosen Unix pathnames :-)


Thanks for the help!

--tim

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Curious performance issue with large number of inserts

2007-05-05 Thread Tim Bradshaw

On 5 May 2007, at 17:11, Joe Wilson wrote:


What timings do you get when you run the perl script in my last email?


I'll try it Tuesday, I'm not back at work till then.

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Curious performance issue with large number of inserts

2007-05-05 Thread Joe Wilson
--- Tim Bradshaw <[EMAIL PROTECTED]> wrote:
> > I can't reproduce your problem. I can insert 16M records into your  
> > table
> > schema in 25 minutes on a 5 year old Windows machine. The sqlite3  
> > process
> > had peak RAM usage of less than 20M.
> 
> Rats, I suspect it must be some compiler/architecture specific thing.  

I run GCC-compiled sqlite3 on sparc/solaris without such performance issues.

> I can rebuild it for x86, since I don't really care where it runs. It  
> definitely isn't running out of memory though (at least, not unless  
> it's throttling itself somehow, the machine has loads  free when it  
> gets sick).

What timings do you get when you run the perl script in my last email?

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Curious performance issue with large number of inserts

2007-05-05 Thread Tim Bradshaw

On 5 May 2007, at 15:59, Joe Wilson wrote:

Ignore the idea above - an insert of NULL or an incrementing  
integer for the

INTEGER PRIMARY KEY yields the same timings.


I might try taking it out anyway - I don't really need it other than  
to be able to say "look at row n" to someone.


I did originally have indices before doing the inserts, but I took  
them away.




I can't reproduce your problem. I can insert 16M records into your  
table
schema in 25 minutes on a 5 year old Windows machine. The sqlite3  
process

had peak RAM usage of less than 20M.


Rats, I suspect it must be some compiler/architecture specific thing.  
I can rebuild it for x86, since I don't really care where it runs. It  
definitely isn't running out of memory though (at least, not unless  
it's throttling itself somehow, the machine has loads  free when it  
gets sick).



Thanks

--tim

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Curious performance issue with large number of inserts

2007-05-05 Thread Joe Wilson
--- Joe Wilson <[EMAIL PROTECTED]> wrote:
> > create table filemap(id integer primary key,
> >uid integer, gid integer, mtime integer,
> >vol integer,
> >path varchar(1024));
> > 
> > It has no indices built yet.
> > 
> > I'm adding quite a lot of records to it using a perl script which  
> > generates SQL like this:
> > 
> > begin;
> >   insert into filemap values(null, 1, 1, , 0, "/path/to/file");
> >   ...  more like this ...
> > commit;
> > ... repeat above ...
> > 
> > The uid, gid and mtime fields vary obviously, but there are very many  
> > paths for each uid/gid pair.  The idea is that I need to be able to  
> > say `show me all the files owned by UID x on volume y?', and we have  
> > enough data that awk can't hack it.
> > 
> > before doing this I've done a
> > 
> > pragma synchronous=off;
> > 
> > All this is just being piped into sqlite3 (I know I should use the  
> > proper interface, but it's a very quick & dirty hack).
> > 
> > I have about 16,000,000 records.  When adding them it goes really  
> > quickly for about the first 1,000,000 (using the big transaction  
> > trick made it much faster, as did the synchronous=off thing).  But  
> > then it slows down dramatically, perhaps by a factor of 100 or 1000  
> > or something.  I've actually left it running, but I'm not convinced  
> > it will have done all 16 million by Monday.
> > 
> > I have not looked at what the program is doing in the sense of system  
> > calls or any more detailed profiling.  It is clear that disk activity  
> > falls right off when it becomes slow.
> > 
> > Am I doing anything obviously stupid here?  I suspect I must be.
> 
> The batch insert you describe ought to be pretty fast since you're only
> appending data.
> 
> This is a guess - instead of inserting NULL for the INTEGER PRIMARY KEY 
> column try assigning an increasing number for each new row.
> This would avoid an OP_NewRowid per insert, which I would not think to
> be slow, but it's worth trying.

Ignore the idea above - an insert of NULL or an incrementing integer for the
INTEGER PRIMARY KEY yields the same timings.

I can't reproduce your problem. I can insert 16M records into your table
schema in 25 minutes on a 5 year old Windows machine. The sqlite3 process
had peak RAM usage of less than 20M.

If you had additional indexes on the table prior to insert (other than the 
INTEGER PRIMARY KEY), that could explain the slow results you are seeing.


$ cat ins.pl

print "
pragma page_size=8192;
pragma temp_store=memory;
pragma synchronous=off;
create table filemap(id integer primary key, uid integer, gid integer,
 mtime integer, vol integer, path varchar(1024));
begin;
";
for (my $i = 1; $i <= 1600; ++$i) {
  if ($i % 1000 == 0) { print "commit;\nbegin;\n"; }
  print "insert into filemap values(null,1,1,,0,'/path/to/file');\n";
}
print "commit;\n";

$ rm -f ins.db ; perl ins.pl | time ./sqlite3 ins.db
1389.89user 61.98system 25:13.57elapsed 95%CPU (0avgtext+0avgdata 
218880maxresident)k
0inputs+0outputs (4938major+0minor)pagefaults 0swaps

$ ls -l ins.db
-rw-r--r-- 1 User Nobody 512417792 May  5 10:47 ins.db

$ ./sqlite3 ins.db "select count(*) from filemap"
1600


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Curious performance issue with large number of inserts

2007-05-05 Thread Joe Wilson
> create table filemap(id integer primary key,
>uid integer, gid integer, mtime integer,
>vol integer,
>path varchar(1024));
> 
> It has no indices built yet.
> 
> I'm adding quite a lot of records to it using a perl script which  
> generates SQL like this:
> 
> begin;
>   insert into filemap values(null, 1, 1, , 0, "/path/to/file");
>   ...  more like this ...
> commit;
> ... repeat above ...
> 
> The uid, gid and mtime fields vary obviously, but there are very many  
> paths for each uid/gid pair.  The idea is that I need to be able to  
> say `show me all the files owned by UID x on volume y?', and we have  
> enough data that awk can't hack it.
> 
> before doing this I've done a
> 
> pragma synchronous=off;
> 
> All this is just being piped into sqlite3 (I know I should use the  
> proper interface, but it's a very quick & dirty hack).
> 
> I have about 16,000,000 records.  When adding them it goes really  
> quickly for about the first 1,000,000 (using the big transaction  
> trick made it much faster, as did the synchronous=off thing).  But  
> then it slows down dramatically, perhaps by a factor of 100 or 1000  
> or something.  I've actually left it running, but I'm not convinced  
> it will have done all 16 million by Monday.
> 
> I have not looked at what the program is doing in the sense of system  
> calls or any more detailed profiling.  It is clear that disk activity  
> falls right off when it becomes slow.
> 
> Am I doing anything obviously stupid here?  I suspect I must be.

The batch insert you describe ought to be pretty fast since you're only
appending data.

This is a guess - instead of inserting NULL for the INTEGER PRIMARY KEY 
column try assigning an increasing number for each new row.
This would avoid an OP_NewRowid per insert, which I would not think to
be slow, but it's worth trying.

You might also experiment with the sqlite3 import facility which should be
slightly faster.

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] Curious performance issue with large number of inserts

2007-05-04 Thread Tim Bradshaw
Apologies if I should have found an answer to this by searching or  
being less naive about SQL!


I'm using 3.3.17 built by me with gcc (3.4.3? can't check just now as  
machine is at work - it's the one Sun ship anyway) on a SPARC Solaris  
10u3 box.


I don't have any significant experience of SQL databases so I may be  
doing something trivially wrong.


I have a table defined by

create table filemap(id integer primary key,
  uid integer, gid integer, mtime integer,
  vol integer,
  path varchar(1024));

It has no indices built yet.

I'm adding quite a lot of records to it using a perl script which  
generates SQL like this:


begin;
 insert into filemap values(null, 1, 1, , 0, "/path/to/file");
 ...  more like this ...
commit;
... repeat above ...

The uid, gid and mtime fields vary obviously, but there are very many  
paths for each uid/gid pair.  The idea is that I need to be able to  
say `show me all the files owned by UID x on volume y?', and we have  
enough data that awk can't hack it.


before doing this I've done a

pragma synchronous=off;

All this is just being piped into sqlite3 (I know I should use the  
proper interface, but it's a very quick & dirty hack).


I have about 16,000,000 records.  When adding them it goes really  
quickly for about the first 1,000,000 (using the big transaction  
trick made it much faster, as did the synchronous=off thing).  But  
then it slows down dramatically, perhaps by a factor of 100 or 1000  
or something.  I've actually left it running, but I'm not convinced  
it will have done all 16 million by Monday.


I have not looked at what the program is doing in the sense of system  
calls or any more detailed profiling.  It is clear that disk activity  
falls right off when it becomes slow.


Am I doing anything obviously stupid here?  I suspect I must be.

Thanks

--tim




-
To unsubscribe, send email to [EMAIL PROTECTED]
-