Re: [sqlite] unexpected large journal file

2010-06-22 Thread Jim Wilcoxson
Eric Smith  writes:
> 
> Jay A.  Kreibich wrote: 
> 
> > Try getting rid of the PK definition and see how much that buys you.  
> > It might be worth it, or it might not.  
> 
> and Simon Slavin wrote: 
> 
> > We know you are doing a huge amount of writing to this database.  Are 
> > you also reading it frequently ?  If not, then it might be worth making an 
> > index on that primary key only when you're about to need it.  
...
> I tried removing the PK definition as you both suggested, and the 
> journal stays fixed at less than 20Kb, even against a db size of (at 
> the moment) 37Gb.  My insert batch run times are improved by a factor of 
> ~2.5 and seem to be O(1).  
> 
> So, bingo.  :) 


Here's a trick that might help you:

Since your primary key is two integers, you can combine them into one integer
and use them as the primary key, without requiring a separate index:

create table x (
  pkey integer primary key,
  val  text);

a = (value 0-2M)
b = (32-bit integer)
pkey = a<<32 | b
insert into x values (pkey, data)

This shift assumes that b is a 32-bit integer.  Since a is limited to 2M, which
requires only 21 bits, b can be up to 43 bits, or 8796093022207.

Also, I think you mentioned that you are inserting records in this order:

a=0 b=0
a=1 b=0
...
a=2M b=0

then

a=0 b=1
a=1 b=1
...
a=2M b=1

To insert records in order, you should insert them as:

a=0 b=0
a=0 b=1
a=1 b=0
a=1 b=1

Jim
---
HashBackup: easy onsite and offsite Unix backup
http://sites.google.com/site/hashbackup




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-20 Thread Eric Smith
Black, Michael (IS) wrote: 

> Hmmma 6.5X speed diff between RAM and disk?  Sounds pretty good to 
> me.  Not sure why you expect better.  

I don't expect better.  As I said, I'm "not really sure what I should 
expect here, or how to find out what I should expect".  

> 10,000/sec is crappy?  And you think this because  

Because I think anything besides instantaneous results is crappy.  :) 

But I'm willing for now to live with the max I should reasonably expect 
from modern hardware.  I'd like to know e.g.  how much raid5 is hurting, 
so I can know whether to recommend that the user move to a single disk.  

Jay, btw, I saw your email on stripe sizes -- thanks so much for the 
suggestion -- I'll look into it.  But see below for a recent update to 
app behavior.  

> #1 What OS are you on?  There are numerous disk speed testing programs 
> depending on your OS.  

64-bit RHEL5.4, Linux 2.6.18 for the moment.  I could probably convince 
the user to switch to FreeBSD6 if there are big wins there.  My app 
already works in both OSs.  

> #2 Are you multi-threading?  A seperate reader process could help.  

Just one process, one thread.  

I'm sure you're right -- adding a reader process would certainly help 
run time by a few percentage points, but at the (unjustified at the 
moment) expense of increased maintenance cost & dev time.  This is an 
extremely cheap project so far and I'm trying to keep it that way.  

> #3 How many records total?  

Probably a couple billion, where each record is about 300 bytes across 
about 15 columns.  

> #4 Final size of database?  

After indices I'm guessing we'll be at ~500Gb.  When the db has 0.5b 
records and all indices are defined, we're at around 130Gb.  

> #5 How fast can you read your input file?  

I can generate inputs to SQLite at a rate of at least 65k records/sec.  
I haven't measured the input generation separately from the sqlite 
calls.  

> #6 What happens if you just insert the same records the same # of times 

Haven't tested it.  

> #7 What does your CPU usage show?  

After dropping the indices (per Jay et al's suggestion) I think SQLite 
is actually spending more than half its time on the CPU under large 
record sets.  Obviously I need to measure that more carefully under the 
new DB schema.  

My initial guess that I was inserting in O(1) was wrong -- time to 
insert 2m records went up by about a minute per 100m existing records.  

And this part is interesting: I finished the initial seed and created 
my user indices.  Then I added some more records, and found that insert 
times went down from 9 minutes to 2 minutes per 2 million records.  The 
plot is *striking*.  (I'd like to show it to the forum -- is it possible 
to send emails with attachments here?  It's a 60kb jpg file.)  I'm back 
up to inserting 17k records/second and am almost entirely CPU-bound.  I 
think I'm back in RAM!  

What is going on here?  Is SQLite using index data to do inserts more 
quickly?  Do you think that's causing it to need to read fewer pages on 
a particular insert?  

I'm very interested to see how this would look if I defined the user 
indices before the initial seed.  

> I assume you're multicore (as most are now I think).  

Yes, I have multiple CPUs, but I'm in one process in one thread so I'm 
only using one CPU at a time.  

Now that I'm seeing this CPU-bound behavior after adding indices, I'm 
reconsidering the whole multi-process thing.  Still, at least 75% of CPU 
usage is in SQLite.  More testing needed.  

> Depending on what you're doing with this data are you sure you need a 
> database solution?  

No.  But let's assume for now it's the best thing available to solve 
my problem under tight time constraints -- because in any case they're 
interesting questions, right?  :) 

> I don't recall you really explaining your ultimate goal...  

I tried to state the question as generally as possible while capturing 
the relevant specifics of my problem, so that gurus' answers will be 
useful to more people (including future Eric who is writing another 
application).  

I'll try starting off with those user indices and see how we do.

Thanks again!

Eric 

-- 
Eric A. Smith

I have always wished that my computer would be as easy to use as 
my telephone. My wish has come true. I no longer know how to use 
my telephone.
-- Bjarne Stroustrup
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-20 Thread Eric Smith
Eric Smith wrote: 

> I'd like to show it to the forum -- is it possible to send emails with 
> attachments here?  It's a 60kb jpg file.  

God bless the gnuplot developers, who provided an ascii output option:

  time (minutes) to insert 2m records
10 ++--+--+---+--+---+-++
   +   +  +   +  A   +   +  +
 9 ++..++
   |  AAA A |
 8 ++..AAA..A..++
   | A      |
 7 ++.A.A..AAA.++
   |    |
 6 ++..AA...A.A++
   |AAA |
 5 ++A...AA++
 4 ++.AA.AA..A.A...++
   |   A AAA  AA|
 3 ++..++
   |   AAA A  A |
 2 ++.AA...++
   | AA |
 1 ++...AAA++
   AAA +  +   +  +   +  +
 0 ++--+--+---+--+---+-++
   0  100200 300400 500600
   millions of existing records


You can see how we ran out of RAM at 100m records.  From 100m to 200m
there was a cron job running that was causing the disk cache to get
thrown out; I killed it at around 200m records.  

You can see the linear progression quite clearly.

At ~480m records I halted the process, built user indices, and restarted.
And voila, we're back down in happy-land.

Eric

-- 
Eric A. Smith

The people can always be brought to the bidding of the leaders. That 
is easy. All you have to do is tell them they are being attacked and 
denounce the pacifists for lack of patriotism and exposing the 
country to danger. It works the same way in any country. 
-- Herman Goering, at the Nuremberg trials
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-20 Thread Jay A. Kreibich
On Sat, Jun 19, 2010 at 12:58:45PM -0400, Eric Smith scratched on the wall:
> Jay A.  Kreibich wrote: 
> 
> I think the use case will usually be (only) writes followed by (only) 
> reads.  There may be incremental writes later, but they will hopefully 
> be small compared to the initial seed.  I intend to create a set of 
> maybe 7 indices to serve the particular queries I think the user intends 
> to run.  Let's all hope we can update the db with those indices at a 
> higher rate than the user can generate data.  :p 

  Indexes do, unfortunately, take some time.  The PK values you were
  inserting in-order, which helps, but if you need that many indexes
  you're going to need to shuffle a lot of data.

> Insert rates are still a rather crappy 10k records/second (when we were 
> in RAM we were doing ~65k recs/sec, which I think was bound by my app's 
> speed). 

  As others have said, a 6.5x difference between disk and memory is
  darn good, even with the most expensive arrays.

> I think I'm at the stupid raid5's limit -- not really sure what 
> I should expect there, or how to find out what I should expect.  

  Try adjusting the page size.  If possible, match the page size and
  the RAID strip size.  RAID 4 and 5 suffer from a "read on write"
  condition to gather the parity data.  If you can write chunks of data
  that match the stripe size, the read can be skipped, which boosts
  performance.  This is especially true for writing data that was not
  previously read (e.g. growing a file) and might still be in the
  RAID cache.

  The max page size is 32K, which might not be big enough to cover the
  strip size.  Depends a lot on how many disks you have and how the
  RAID is configured.
  
  Regardless, even if you can't match the strip size, I would try
  increasing the page size.  With a database like yours, you can likely
  take the possible storage hit if it improves performance by reducing
  the number of I/O transactions.

  Just be sure you adjust your page cache.  The cache is defined in
  pages, so if you bump up the page size you might need to reduce the
  cache size to avoid using too much memory.

> Anyway, I'll build the indices after the seed.  I'll go out on a limb 
> and assume that'll be a lot faster than it would've been under my 
> initial approach.  

  Not always.  Building after insert is usually a bit faster, but only
  slightly.  Such is life.  If you want the performance, you need to
  pay the price.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-20 Thread Jay A. Kreibich
On Fri, Jun 18, 2010 at 07:01:25PM -0700, Scott Hess scratched on the wall:
> On Fri, Jun 18, 2010 at 5:24 PM, Eric Smith  wrote:
> > Jay A.  Kreibich wrote:
> >> Yes.  Hence the "and this is the important part" comment.  Most of
> >> the time when people are building billion-row files, they're building
> >> a new DB by importing a static source of data.  If things go wrong,
> >> you just throw out the database and try again.
> >
> > That's kinda like doing it all in a single big transaction, which I
> > wanted to avoid.  :)
> >
> > But my take-away from this is conversation is generally "you're SOL":
> > we are backed by a tree, so we have to update existing nodes to point
> > to the new ones, so we have to touch existing pages on INSERT.  Is that
> > about right?
> 
> The problem is that if your incoming data is uniformly unsorted
> (essentially random), then for each transaction you'll find yourself
> updating approximately every page of the database.

  In a linear format, half.  You only move every page if you insert at
  the beginning (e.g. a perfect reverse-sort).  The number of elements
  changes as well, as you add them.  So when inserting N elements,
  each insert requires, on average, 25% of N elements to be updated.

  But databases don't work that way.  The B-Trees are specifically
  designed to address the needs of a paged database, and require minimal
  updates to keep the tree balanced.

  In this case, the original primary key was being inserted in-order,
  and the ROWID values, being generated, are clearly in-order.  It was
  already best case.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-20 Thread Black, Michael (IS)
Hmmma 6.5X speed diff between RAM and disk?  Sounds pretty good to me.  Not 
sure why you expect better. 10,000/sec is crappy?  And you think this 
because
 
Several things you need to provide.
 
#1 What OS are you on?  There are numerous disk speed testing programs 
depending on your OS.
#2 Are you multi-threading?  A seperate reader process could help.
#3 How many records total?
#4 Final size of database?
#5 How fast can you read your input file?
#6 What happens if you just insert the same records the same # of times
#7 What does your CPU usage show?  I assume you're multicore (as most are now I 
think).
 
Depending on what you're doing with this data are you sure you need a database 
solution?
 
I don't recall you really explaining your ultimate goal...
 
Michael D. Black
Senior Scientist
Northrop Grumman Mission Systems
 



From: sqlite-users-boun...@sqlite.org on behalf of Eric Smith
Sent: Sat 6/19/2010 11:58 AM
To: Jay A. Kreibich
Cc: sqlite-users@sqlite.org
Subject: Re: [sqlite] unexpected large journal file



Jay A.  Kreibich wrote:

> Try getting rid of the PK definition and see how much that buys you. 
> It might be worth it, or it might not. 

and Simon Slavin wrote:

> We know you are doing a huge amount of writing to this database.  Are
> you also reading it frequently ?  If not, then it might be worth making an
> index on that primary key only when you're about to need it. 

I think the use case will usually be (only) writes followed by (only)
reads.  There may be incremental writes later, but they will hopefully
be small compared to the initial seed.  I intend to create a set of
maybe 7 indices to serve the particular queries I think the user intends
to run.  Let's all hope we can update the db with those indices at a
higher rate than the user can generate data.  :p

I tried removing the PK definition as you both suggested, and the
journal stays fixed at less than 20Kb, even against a db size of (at
the moment) 37Gb.  My insert batch run times are improved by a factor of
~2.5 and seem to be O(1). 

So, bingo.  :)

Insert rates are still a rather crappy 10k records/second (when we were
in RAM we were doing ~65k recs/sec, which I think was bound by my app's
speed).  I think I'm at the stupid raid5's limit -- not really sure what
I should expect there, or how to find out what I should expect. 

Anyway, I'll build the indices after the seed.  I'll go out on a limb
and assume that'll be a lot faster than it would've been under my
initial approach. 

You guys were incredibly helpful -- thanks very much! 

Eric

--
Eric A. Smith

Carperpetuation (kar' pur pet u a shun), n.:
The act, when vacuuming, of running over a string at least a
dozen times, reaching over and picking it up, examining it, then
putting it back down to give the vacuum one more chance.
-- Rich Hall, "Sniglets"
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-19 Thread Gerry Snyder
On 6/18/2010 7:01 PM, Scott Hess wrote:
>
> The old-school solution to this problem is an external sort (*).  The
> basic idea is that you process the incoming data in memory-sized
> chunks (ie, fast), then write out sorted subfiles.  Then you process
> the sorted files in parallel, merging to the output.
>

The other old-school solution involving preprocessing is block sorting 
or binning. You go through the data once and create a bunch of files of 
unsorted but grouped data. Then insert the data from the files in order. 
If the files are transaction-sized, each one will be adding to a fairly 
small range of values.

Gerry


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-19 Thread Eric Smith
Jay A.  Kreibich wrote: 

> Try getting rid of the PK definition and see how much that buys you.  
> It might be worth it, or it might not.  

and Simon Slavin wrote: 

> We know you are doing a huge amount of writing to this database.  Are 
> you also reading it frequently ?  If not, then it might be worth making an 
> index on that primary key only when you're about to need it.  

I think the use case will usually be (only) writes followed by (only) 
reads.  There may be incremental writes later, but they will hopefully 
be small compared to the initial seed.  I intend to create a set of 
maybe 7 indices to serve the particular queries I think the user intends 
to run.  Let's all hope we can update the db with those indices at a 
higher rate than the user can generate data.  :p 

I tried removing the PK definition as you both suggested, and the 
journal stays fixed at less than 20Kb, even against a db size of (at 
the moment) 37Gb.  My insert batch run times are improved by a factor of 
~2.5 and seem to be O(1).  

So, bingo.  :) 

Insert rates are still a rather crappy 10k records/second (when we were 
in RAM we were doing ~65k recs/sec, which I think was bound by my app's 
speed).  I think I'm at the stupid raid5's limit -- not really sure what 
I should expect there, or how to find out what I should expect.  

Anyway, I'll build the indices after the seed.  I'll go out on a limb 
and assume that'll be a lot faster than it would've been under my 
initial approach.  

You guys were incredibly helpful -- thanks very much!  

Eric 

-- 
Eric A. Smith

Carperpetuation (kar' pur pet u a shun), n.:
The act, when vacuuming, of running over a string at least a
dozen times, reaching over and picking it up, examining it, then
putting it back down to give the vacuum one more chance.
-- Rich Hall, "Sniglets"
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-19 Thread Simon Slavin

On 19 Jun 2010, at 1:24am, Eric Smith wrote:

> It's not in any way a result of my schema?  My primary key is a pair of
> integers A,B.  The first column in this particular use case is in the 
> range A = [0, 2million) and the second is in the range B = [0, infinity).
> We
> 
> insert records A=0->2million, B=0, then
> insert records A=0->2million, B=1,

We know you are doing a huge amount of writing to this database.  Are you also 
reading it frequently ? If not, then it might be worth making an index on that 
primary key only when you're about to need it.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Scott Hess
On Fri, Jun 18, 2010 at 5:24 PM, Eric Smith  wrote:
> Jay A.  Kreibich wrote:
>> Yes.  Hence the "and this is the important part" comment.  Most of
>> the time when people are building billion-row files, they're building
>> a new DB by importing a static source of data.  If things go wrong,
>> you just throw out the database and try again.
>
> That's kinda like doing it all in a single big transaction, which I
> wanted to avoid.  :)
>
> But my take-away from this is conversation is generally "you're SOL":
> we are backed by a tree, so we have to update existing nodes to point
> to the new ones, so we have to touch existing pages on INSERT.  Is that
> about right?

The problem is that if your incoming data is uniformly unsorted
(essentially random), then for each transaction you'll find yourself
updating approximately every page of the database.  When things are
small and fit in the cash, that's fine, but when they're large, it's
not great because you become entirely seek dominated.

The old-school solution to this problem is an external sort (*).  The
basic idea is that you process the incoming data in memory-sized
chunks (ie, fast), then write out sorted subfiles.  Then you process
the sorted files in parallel, merging to the output.  Since the merge
step only needs to see the next record for each input, this can make a
TREMENDOUS amount of difference for large datasets.

For certain datasets, you can just use GNU sort to accomplish this.
You can also code it up on top of SQLite by loading multiple
sub-tables in a temporary database, and then scanning them in sorted
order merging to an output table in your final database.

-scott

(*) http://en.wikipedia.org/wiki/External_sorting
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Jay A. Kreibich
On Fri, Jun 18, 2010 at 08:24:47PM -0400, Eric Smith scratched on the wall:
> Jay A.  Kreibich wrote: 
> 
> > Yes.  Hence the "and this is the important part" comment.  Most of 
> > the time when people are building billion-row files, they're building 
> > a new DB by importing a static source of data.  If things go wrong, 
> > you just throw out the database and try again.  
> 
> That's kinda like doing it all in a single big transaction, which I 
> wanted to avoid.  :) 

  Sorta, kinda, only you don't have the option of rolling it back.

> But my take-away from this is conversation is generally "you're SOL": 
> we are backed by a tree, so we have to update existing nodes to point 
> to the new ones, so we have to touch existing pages on INSERT.  Is that 
> about right?  

  If the goal is to eliminate the journal file, then yes.

> It's not in any way a result of my schema? 

  It is directly dependent on the schema.  I don't remember seeing a,
  other than you stating you didn't have any user indexes.

> My primary key is a pair of integers A,B.

  Then you have an automatic unique index across these two columns.
  The additional index will cause more page churn.

> The first column in this particular use case is in the 
> range A = [0, 2million) and the second is in the range B = [0, infinity).
> We
> 
> insert records A=0->2million, B=0, then
> insert records A=0->2million, B=1,
> 
> etc.
> 
> Could this have an impact on how many pages need to be touched on
> INSERT?

  Yes, but my guess is that this is the ideal order.

> > It would also help to bump the cache up...  
> 
> That works great until the db size blows through the total RAM on the 
> system, at which point we're of course disk-bound again.

  Yes, but that's not the point.  The goal with making the cache large
  is that you can keep the majority of the BTree nodes in memory.  This
  makes balancing the tree much faster.  You're still going to be
  disk-bound, but the larger cache should help lower the total number
  if I/O operations.  
  
  If the trees don't quite fit into the cache things get *really* slow.

>  At the moment I'm only inserting about 4k rows/second. :/

  Try getting rid of the PK definition and see how much that buys you.
  It might be worth it, or it might not.

-j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Eric Smith
Jay A.  Kreibich wrote: 

> Yes.  Hence the "and this is the important part" comment.  Most of 
> the time when people are building billion-row files, they're building 
> a new DB by importing a static source of data.  If things go wrong, 
> you just throw out the database and try again.  

That's kinda like doing it all in a single big transaction, which I 
wanted to avoid.  :) 

But my take-away from this is conversation is generally "you're SOL": 
we are backed by a tree, so we have to update existing nodes to point 
to the new ones, so we have to touch existing pages on INSERT.  Is that 
about right?  

It's not in any way a result of my schema?  My primary key is a pair of
integers A,B.  The first column in this particular use case is in the 
range A = [0, 2million) and the second is in the range B = [0, infinity).
We

insert records A=0->2million, B=0, then
insert records A=0->2million, B=1,

etc.

Could this have an impact on how many pages need to be touched on
INSERT?

> It would also help to bump the cache up...  

That works great until the db size blows through the total RAM on the 
system, at which point we're of course disk-bound again.  At the moment
I'm only inserting about 4k rows/second. :/

Eric

-- 
Eric A. Smith

I have always wished that my computer would be as easy to use as 
my telephone. My wish has come true. I no longer know how to use 
my telephone.
-- Bjarne Stroustrup
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Eric Smith
Simon Slavin wrote: 

> How do you pass the handle from your C code to your Tcl code ?  

I don't.  I pass it from Tcl to C.  

The handle in Tcl is a command name registered with the interpreter.  
SQLite attaches a structure to the registration that contains the 
underlying handle as used by the C API.  

I'm writing a Tcl extension in C and I don't want to touch a bunch of 
Tcl code.  The Tcl caller gives me its command-name handle to the DB.  
I ask the Tcl core for the registration data for that command name, and 
use it to get the underlying SQLite handle for my C calls.  

(Checking whether I've got a real Tcl db handle is not 100% fool proof, 
but takes care of basic caller stupidity and is good enough for my 
app.)  

With error checks removed:

Tcl_CmdInfo cmdInfo;
const char* zDbHandle = Tcl_GetStringFromObj(objv[2], 0);
Tcl_GetCommandInfo(pInterp, zDbHandle, );
sqlite3 *pDb = *((sqlite3**)cmdInfo.objClientData);
/* Carry on, using pDb in C API calls.
*/

I was considering asking about this a while ago on this group, but opted 
not to for fear of being publicly flogged by drh. :)

Actually, I was hoping the SQLite devs would promise not to change the
way they do Tcl command registration in future releases, so this kind 
of thing will continue to work.

> You are sharing the same connection to the database between the two 
> languages, right ?  

Yep.  

> You're not doing your INSERTs from one connection and your COMMITs from 
> another ?  

Right, the one connection is shared.  

Eric 

-- 
Eric A. Smith

Substitute "damn" every time you're inclined to write "very"; your
editor will delete it and the writing will be just as it should be.
-- Mark Twain
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Simon Slavin

On 18 Jun 2010, at 8:47pm, Eric Smith wrote:

> Don't know what I was thinking when I typed that.  I'm sharing a
> connection in a single thread, mixing C API calls and Tcl API calls.  
> The C API calls drive the INSERTs; Tcl API calls drive BEGIN/COMMIT

How do you pass the handle from your C code to your Tcl code ?  You are sharing 
the same connection to the database between the two languages, right ?  You're 
not doing your INSERTs from one connection and your COMMITs from another ?

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Jay A. Kreibich
On Fri, Jun 18, 2010 at 04:07:53PM -0400, Eric Smith scratched on the wall:
> Jay A. Kreibich wrote:
> 
> > > I'd really love to avoid writing a big journal file.  And I'd love to 
> > > avoid doing a billion-row insert in one transaction.
> > 
> >   So turn journaling off.
> 
> ... which implies possible corruption on app failure, right?

  Yes.  Hence the "and this is the important part" comment.  Most of
  the time when people are building billion-row files, they're building
  a new DB by importing a static source of data.  If things go wrong,
  you just throw out the database and try again.

  If that's not your situation, then turning journaling off may not be
  an option.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Eric Smith
Jay A. Kreibich wrote:

> > I'd really love to avoid writing a big journal file.  And I'd love to 
> > avoid doing a billion-row insert in one transaction.
> 
>   So turn journaling off.

... which implies possible corruption on app failure, right?

I want progress to be saved every once in a while so I don't have to
re-build the db from scratch when my app comes back to life.

--
Eric A. Smith

Don Daniels: How stable are tense systems in languages?
Tim Pulju:   I'd say about a 42 on a scale from 1 to 212.
-- Winter 2005
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Jay A. Kreibich
On Fri, Jun 18, 2010 at 03:37:19PM -0400, Eric Smith scratched on the wall:
> I have no user-defined indices in my db, and want to do a largish number 
> of inserts (a few billion).  I COMMIT every 10 million INSERTs or so -- 
> so if my app dies (or I want to kill it) I don't have to start over.  
> 
> Row sizes are small, a couple hundred bytes across 15ish columns.  
> The primary key on the table is a pair of integers.
> 
> After a few BEGIN/INSERT/COMMIT cycles the journal file grows 
> mid-transaction to a pretty big size, e.g.  around 1Gb against a 14Gb db 
> file, meaning (right?) that sqlite wrote to ~1Gb of the existing 
> pages during that round of INSERTs.  

  I'd guess this is B-Tree re-balancing.  It happens with both indexes
  and the tables themselves.  It will be worse if the file has an
  INTEGER PRIMARY KEY that you're providing, and isn't pre-sorted.

> This means the time spent doing a batch of INSERTs goes up as the number 
> of existing rows, which is a big frowny-face.* 

  Perhaps, but that's not unexpected.

> I'd really love to avoid writing a big journal file.  And I'd love to 
> avoid doing a billion-row insert in one transaction.

  So turn journaling off.  If you're building a database from an
  external source and (this is the important part) can re-build the
  database if something goes wrong, just turn off journaling and
  syncing for the duration of the data import.  It would also help to
  bump the cache up... if you're on a nice desktop with a few gigs of
  RAM, bump it up 10x to 100x.  There are PRAGMAs to do all this.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] unexpected large journal file

2010-06-18 Thread Eric Smith

Terribly sorry to self-reply, but I have a correction: 

> I'm using the Tcl API, which probably doesn't matter for this question.  

Don't know what I was thinking when I typed that.  I'm sharing a
connection in a single thread, mixing C API calls and Tcl API calls.  
The C API calls drive the INSERTs; Tcl API calls drive BEGIN/COMMIT.  

That all works fine, so don't worry about it.

Thanks again!  

Eric 

-- 
Eric A. Smith

Absurdity, n.:
A statement or belief manifestly inconsistent with one's 
own opinion.
-- Ambrose Bierce, "The Devil's Dictionary"
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] unexpected large journal file

2010-06-18 Thread Eric Smith
I have no user-defined indices in my db, and want to do a largish number 
of inserts (a few billion).  I COMMIT every 10 million INSERTs or so -- 
so if my app dies (or I want to kill it) I don't have to start over.  

Row sizes are small, a couple hundred bytes across 15ish columns.  
The primary key on the table is a pair of integers.

After a few BEGIN/INSERT/COMMIT cycles the journal file grows 
mid-transaction to a pretty big size, e.g.  around 1Gb against a 14Gb db 
file, meaning (right?) that sqlite wrote to ~1Gb of the existing 
pages during that round of INSERTs.  

This means the time spent doing a batch of INSERTs goes up as the number 
of existing rows, which is a big frowny-face.* 

I'd really love to avoid writing a big journal file.  And I'd love to 
avoid doing a billion-row insert in one transaction.

So, if I'm only doing INSERTs and they're always on brand new rows and 
there are no indices, why does SQLite need to update existing pages in 
the DB?  (Is it updating an auto-index based on the primary key?)  Is 
there a way to avoid it?  

Thanks!

Eric 

PS I'm using 3.6.23.1 with defaults, except PRAGMA synchronous=OFF and 
foreign_keys=1 (my only foreign key refs are to tiny tables).  I'm using 
the Tcl API, which probably doesn't matter for this question.  

% db eval {pragma compile_options} 
ENABLE_FTS3 ENABLE_LOCKING_STYLE=0 TEMP_STORE=1 THREADSAFE=1 

* I found out today my users are using a raid5 array on the deployment 
box, so it's an even bigger frowny face than it would have been by 
default.  

-- 
Eric A. Smith

We don't like their sound, and guitar music is on the way out.
-- Decca Recording Co. rejecting the Beatles, 1962.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users