Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-07 Thread Kevin O'Gorman
On Sat, Aug 6, 2016 at 2:49 PM, Kevin O'Gorman 
wrote:

> On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedy  wrote:
>
>> On 08/06/2016 09:52 AM, Kevin O'Gorman wrote:
>>
>>> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy 
>>> wrote:
>>>
>>> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:

 On Fri, Aug 5, 2016 at 1:08 PM, David Raymond  >
> wrote:
>
> ..
>
> Apart from the default location of the files, it reads like your next
> main
>
>> concern is how many temp files get opened up. My bet is that it'll be
>> a
>> very small number, just potentially huge in file size while it's doing
>> its
>> thing. But again, try that pragma and take a look.
>>
>> My best bet is the contrary:  it starts with small files and makes
>>
> increasingly larger ones, like the sort utility does.  The problem is
> that
> there are too many of them at the beginning for it to work with
> anonymous
> files (which sort does not use).  This at least offers a possible
> explanation of its getting wedged on large indexes: an unexpected and
> untested error, handled poorly.
>
> You could verify this by checking the number of open handles in
 "/proc//fd" after your process is wedged.

 Excellent idea.  I did not know about that possibility.  And sure
 enough,

>>> I'm wrong.  It's using anonymous files, all right, but only one or two
>>> at a
>>> time.  I assume they're big.  I'm in the process of bracketing where size
>>> begins to matter.  So far, 1/10 of the data loads and indexes just fine,
>>> albeit somewhat more slowly that the smaller samples predicted.  The
>>> database load took 6.5 minutes, the troublesome index 10 minutes.  At
>>> smaller sizes, indexing is faster than the database load.
>>>
>>> I'm trying 1/3 now (500 million lines)
>>>
>>
>> What does [top] tell you once the process becomes wedged? What percentage
>> is the CPU running at? Or is it completely bogged down waiting for IO?
>>
>> Dan
>
>
> I'm waiting for a good time to get the answer to this.  It takes a good
> long while to get to wedged, so I'll probably do it overnight tonight.
>

RETRACTION: it doesn't get wedged after all, it just takes about 4 times
longer than I expected.  On small inputs, it makes the indexes faster than
the data table.  When I test on the whole data file, it takes 4 times
longer to build each index than it took to make the table.  I guess that's
what it looks like when building the table is O(n) complexity (linear), and
the index is O(n * log(n)) and log(n) is getting bigger.

Sorry for the kerfluffle.  This is my first time working with datasets this
big.  Since it takes almost 6 hours to build each index (there are 2 at
this point), and I didn't know to expect that, I drew the wrong
conclusion.  My bad.

I'm just glad Mr. Hipp got into this thread and advised to build the
indexes after the data is loaded.  It would have been _lots_ slower
inserting each index individually, and in fact my first attempt -- the one
that gave me the initial idea that this had wedged -- took this approach
and I decided it was wedged after a few _days_.  I suspect it wasn't
either, but might as well have been.

So far, this has been a feasibility exercise.  Building a 500 GB database
overnight is definitely feasible.  Now I can go on to see if it helps me
solve my problem.


-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-06 Thread Kevin O'Gorman
On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedy  wrote:

> On 08/06/2016 09:52 AM, Kevin O'Gorman wrote:
>
>> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy 
>> wrote:
>>
>> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:
>>>
>>> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond 
 wrote:

 ..

 Apart from the default location of the files, it reads like your next
 main

> concern is how many temp files get opened up. My bet is that it'll be a
> very small number, just potentially huge in file size while it's doing
> its
> thing. But again, try that pragma and take a look.
>
> My best bet is the contrary:  it starts with small files and makes
>
 increasingly larger ones, like the sort utility does.  The problem is
 that
 there are too many of them at the beginning for it to work with
 anonymous
 files (which sort does not use).  This at least offers a possible
 explanation of its getting wedged on large indexes: an unexpected and
 untested error, handled poorly.

 You could verify this by checking the number of open handles in
>>> "/proc//fd" after your process is wedged.
>>>
>>> Excellent idea.  I did not know about that possibility.  And sure enough,
>>>
>> I'm wrong.  It's using anonymous files, all right, but only one or two at
>> a
>> time.  I assume they're big.  I'm in the process of bracketing where size
>> begins to matter.  So far, 1/10 of the data loads and indexes just fine,
>> albeit somewhat more slowly that the smaller samples predicted.  The
>> database load took 6.5 minutes, the troublesome index 10 minutes.  At
>> smaller sizes, indexing is faster than the database load.
>>
>> I'm trying 1/3 now (500 million lines)
>>
>
> What does [top] tell you once the process becomes wedged? What percentage
> is the CPU running at? Or is it completely bogged down waiting for IO?
>
> Dan


I'm waiting for a good time to get the answer to this.  It takes a good
long while to get to wedged, so I'll probably do it overnight tonight.


-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-06 Thread Dan Kennedy

On 08/06/2016 09:52 AM, Kevin O'Gorman wrote:

On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy  wrote:


On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:


On Fri, Aug 5, 2016 at 1:08 PM, David Raymond 
wrote:

..

Apart from the default location of the files, it reads like your next main

concern is how many temp files get opened up. My bet is that it'll be a
very small number, just potentially huge in file size while it's doing
its
thing. But again, try that pragma and take a look.

My best bet is the contrary:  it starts with small files and makes

increasingly larger ones, like the sort utility does.  The problem is that
there are too many of them at the beginning for it to work with anonymous
files (which sort does not use).  This at least offers a possible
explanation of its getting wedged on large indexes: an unexpected and
untested error, handled poorly.


You could verify this by checking the number of open handles in
"/proc//fd" after your process is wedged.

Excellent idea.  I did not know about that possibility.  And sure enough,

I'm wrong.  It's using anonymous files, all right, but only one or two at a
time.  I assume they're big.  I'm in the process of bracketing where size
begins to matter.  So far, 1/10 of the data loads and indexes just fine,
albeit somewhat more slowly that the smaller samples predicted.  The
database load took 6.5 minutes, the troublesome index 10 minutes.  At
smaller sizes, indexing is faster than the database load.

I'm trying 1/3 now (500 million lines)


What does [top] tell you once the process becomes wedged? What 
percentage is the CPU running at? Or is it completely bogged down 
waiting for IO?


Dan.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Kevin O'Gorman
On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy  wrote:

> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:
>
>> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond 
>> wrote:
>>
>> ..
>>
>> Apart from the default location of the files, it reads like your next main
>>> concern is how many temp files get opened up. My bet is that it'll be a
>>> very small number, just potentially huge in file size while it's doing
>>> its
>>> thing. But again, try that pragma and take a look.
>>>
>>> My best bet is the contrary:  it starts with small files and makes
>> increasingly larger ones, like the sort utility does.  The problem is that
>> there are too many of them at the beginning for it to work with anonymous
>> files (which sort does not use).  This at least offers a possible
>> explanation of its getting wedged on large indexes: an unexpected and
>> untested error, handled poorly.
>>
>
> You could verify this by checking the number of open handles in
> "/proc//fd" after your process is wedged.
>
> Excellent idea.  I did not know about that possibility.  And sure enough,
I'm wrong.  It's using anonymous files, all right, but only one or two at a
time.  I assume they're big.  I'm in the process of bracketing where size
begins to matter.  So far, 1/10 of the data loads and indexes just fine,
albeit somewhat more slowly that the smaller samples predicted.  The
database load took 6.5 minutes, the troublesome index 10 minutes.  At
smaller sizes, indexing is faster than the database load.

I'm trying 1/3 now (500 million lines)

-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Kevin O'Gorman
On Fri, Aug 5, 2016 at 3:03 PM, Darren Duncan 
wrote:

> On 2016-08-04 7:27 AM, Jim Callahan wrote:
>
>> Steps
>> Agree with Darren Duncan and Dr. Hipp you may want to have at least 3
>> separate steps
>> (each step should be a separate transaction):
>>
>> 1. Simple load
>> 2. Create additional column
>> 3. Create index
>>
>> Have you pre-defined the table you are loading data into? (step 0 CREATE
>> TABLE)
>>
>> If "Step 1 Simple Load" does not complete; then may want to load a fixed
>> number of rows into separate tables (per Darren Duncan)  and then combine
>> using an APPEND
>> or a UNION query (doing so before steps 2 and 3).
>>
>
> To be clear, my proposal of only loading a subset of rows was just
> intended for debugging the performance issues the OP was having and try and
> figure out what is causing the problem without having to wait too long for
> larger sets to complete.  I wasn't proposing splitting the load into
> separate databases and unioning later, rather the attempt database would be
> thrown away after each trial. -- Darren Duncan
>

Thanks, I understood that and it was a good set of suggestions.  Using
separate transactions led to focusing on the CREATE INDEX part of the
process.  That's where it gets wedged.  Working up to the full database, by
first processing smaller extracts, convinced me that size matters.  Smaller
(even millions of row) extracts do not wedge.

-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Darren Duncan

On 2016-08-04 7:27 AM, Jim Callahan wrote:

Steps
Agree with Darren Duncan and Dr. Hipp you may want to have at least 3
separate steps
(each step should be a separate transaction):

1. Simple load
2. Create additional column
3. Create index

Have you pre-defined the table you are loading data into? (step 0 CREATE
TABLE)

If "Step 1 Simple Load" does not complete; then may want to load a fixed
number of rows into separate tables (per Darren Duncan)  and then combine
using an APPEND
or a UNION query (doing so before steps 2 and 3).


To be clear, my proposal of only loading a subset of rows was just intended for 
debugging the performance issues the OP was having and try and figure out what 
is causing the problem without having to wait too long for larger sets to 
complete.  I wasn't proposing splitting the load into separate databases and 
unioning later, rather the attempt database would be thrown away after each 
trial. -- Darren Duncan

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Dan Kennedy

On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:

On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com>
wrote:


There's a depreciated pragma, PRAGMA temp_store_directory =
'directory-name'; that apears to still work ok for now at least.
http://www.sqlite.org/pragma.html#pragma_temp_store_directory

I've used it for the same reasons you've both mentioned about space,
though for me it vacuuming a huge db, and the full size db copy that makes
which ate up my entire C drive. But with that pragma you can specify
exactly which folder you want to use for your temp files. I'd suggest try
using that, then monitor the folder you give it to see what shows up.


Since it's deprecated, I'll stick with TMPDIR because it's pretty standard
on Linux (it also works with the sort utility for instance).



Apart from the default location of the files, it reads like your next main
concern is how many temp files get opened up. My bet is that it'll be a
very small number, just potentially huge in file size while it's doing its
thing. But again, try that pragma and take a look.


My best bet is the contrary:  it starts with small files and makes
increasingly larger ones, like the sort utility does.  The problem is that
there are too many of them at the beginning for it to work with anonymous
files (which sort does not use).  This at least offers a possible
explanation of its getting wedged on large indexes: an unexpected and
untested error, handled poorly.


You could verify this by checking the number of open handles in 
"/proc//fd" after your process is wedged.











-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
On Behalf Of Kevin O'Gorman
Sent: Friday, August 05, 2016 3:41 PM
To: SQLite mailing list
Subject: Re: [sqlite] newbie has waited days for a DB build to complete.
what's up with this.

On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote:


Hi, Kevin,

On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
wrote:

Okay, I followed some of the advice y'all gave and got some results.

1. The original problem was compromised by malformed input.  However,

it

appears that did not cause the wedging of the process.  See (3) below.

Where are the data will come?
 From the user? Internet?

What I'm getting at is - you need to look for some malformed data in
the future as well.


I generate it.  I goofed, and I'll try not to goof in the future.



2. I separated the steps, and started small.  Time increased slightly
sub-linearly with dataset size, so I jumped to doing the whole thing.

With

proper input, the data was loaded in 68 minutes.

3. The CREATE INDEX steps failed quickly (2 minutes), reporting

"database

or disk is full" which seemed odd since most of my partitions have much
more free space than the entire database.  It turns out that whatever

does

the creation was using space on my root partition (this is Linux, so

that

means "/").  That's the only partition in my setup without a huge

amount

of

free space.  On would expect temporary stuff to go to /tmp (which has

3TB

free), but it doesn't go there.  I would go there if the system's

native

"sort" program were used.  Fortunately, it turns out that the TMPDIR
environment variable is honored, but while I could see space was being
used, there were no files visible.  I take that to mean that the

tmpfile()

function (or equivalent) was used.  This could be a bad idea for large
indexes because anonymous files have to be kept open, and there's a

limit

on the number of files that can be open at a time, around 1,000.  Sure
enough, the index creation appears to be wedged like the original run,

and

after a few hours I killed it manually.  This is a deal-killer.

The failure you saw - is it on the table with the complete data set?
Or you got it during the experimenting?

Only on the complete data set.



So the questions are: Where do bug reports go?  I seem to be running

3.8.2;

is this fixed in any later version?

You can try the "3.14" pre-released one right now. ;-)


Meh.  I submitted a bug report to this list.  I'll see what happens.



Thank you.



On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman <

kevinogorm...@gmail.com>

wrote:


The metric for feasability is coding ease, not runtime.  I'm the
bottleneck, not the machine, at least at this point.

As for adding rows, it will be about like this time: a billion or so

at

a

time.  But there's no need to save the old data.  Each round can be
separate except for a persistent "solutions" table of much more modest
size.  I've been doing this for a while now, and the solutions file

has

only 10 million or so lines, each representing a game position for

which

optimum moves are known.  Getting this file to include the starting
position is the point of the exercise.

If I ever get to anything like "product

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Kevin O'Gorman
On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com>
wrote:

> There's a depreciated pragma, PRAGMA temp_store_directory =
> 'directory-name'; that apears to still work ok for now at least.
> http://www.sqlite.org/pragma.html#pragma_temp_store_directory
>
> I've used it for the same reasons you've both mentioned about space,
> though for me it vacuuming a huge db, and the full size db copy that makes
> which ate up my entire C drive. But with that pragma you can specify
> exactly which folder you want to use for your temp files. I'd suggest try
> using that, then monitor the folder you give it to see what shows up.
>

Since it's deprecated, I'll stick with TMPDIR because it's pretty standard
on Linux (it also works with the sort utility for instance).


> Apart from the default location of the files, it reads like your next main
> concern is how many temp files get opened up. My bet is that it'll be a
> very small number, just potentially huge in file size while it's doing its
> thing. But again, try that pragma and take a look.
>

My best bet is the contrary:  it starts with small files and makes
increasingly larger ones, like the sort utility does.  The problem is that
there are too many of them at the beginning for it to work with anonymous
files (which sort does not use).  This at least offers a possible
explanation of its getting wedged on large indexes: an unexpected and
untested error, handled poorly.


>
> -Original Message-
> From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
> On Behalf Of Kevin O'Gorman
> Sent: Friday, August 05, 2016 3:41 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] newbie has waited days for a DB build to complete.
> what's up with this.
>
> On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote:
>
> > Hi, Kevin,
> >
> > On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
> > wrote:
> > > Okay, I followed some of the advice y'all gave and got some results.
> > >
> > > 1. The original problem was compromised by malformed input.  However,
> it
> > > appears that did not cause the wedging of the process.  See (3) below.
> >
> > Where are the data will come?
> > From the user? Internet?
> >
> > What I'm getting at is - you need to look for some malformed data in
> > the future as well.
> >
>
> I generate it.  I goofed, and I'll try not to goof in the future.
>
>
> >
> > >
> > > 2. I separated the steps, and started small.  Time increased slightly
> > > sub-linearly with dataset size, so I jumped to doing the whole thing.
> > With
> > > proper input, the data was loaded in 68 minutes.
> > >
> > > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting
> "database
> > > or disk is full" which seemed odd since most of my partitions have much
> > > more free space than the entire database.  It turns out that whatever
> > does
> > > the creation was using space on my root partition (this is Linux, so
> that
> > > means "/").  That's the only partition in my setup without a huge
> amount
> > of
> > > free space.  On would expect temporary stuff to go to /tmp (which has
> 3TB
> > > free), but it doesn't go there.  I would go there if the system's
> native
> > > "sort" program were used.  Fortunately, it turns out that the TMPDIR
> > > environment variable is honored, but while I could see space was being
> > > used, there were no files visible.  I take that to mean that the
> > tmpfile()
> > > function (or equivalent) was used.  This could be a bad idea for large
> > > indexes because anonymous files have to be kept open, and there's a
> limit
> > > on the number of files that can be open at a time, around 1,000.  Sure
> > > enough, the index creation appears to be wedged like the original run,
> > and
> > > after a few hours I killed it manually.  This is a deal-killer.
> >
> > The failure you saw - is it on the table with the complete data set?
> > Or you got it during the experimenting?
> >
> > Only on the complete data set.
>
>
> > >
> > > So the questions are: Where do bug reports go?  I seem to be running
> > 3.8.2;
> > > is this fixed in any later version?
> >
> > You can try the "3.14" pre-released one right now. ;-)
> >
>
> Meh.  I submitted a bug report to this list.  I'll see what happens.
>
>
> > Thank you.
> >
> > >
> > >
> > > On Thu, Aug 4, 2016 at 9:27 AM, 

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread David Raymond
There's a depreciated pragma, PRAGMA temp_store_directory = 'directory-name'; 
that apears to still work ok for now at least.
http://www.sqlite.org/pragma.html#pragma_temp_store_directory

I've used it for the same reasons you've both mentioned about space, though for 
me it vacuuming a huge db, and the full size db copy that makes which ate up my 
entire C drive. But with that pragma you can specify exactly which folder you 
want to use for your temp files. I'd suggest try using that, then monitor the 
folder you give it to see what shows up.

Apart from the default location of the files, it reads like your next main 
concern is how many temp files get opened up. My bet is that it'll be a very 
small number, just potentially huge in file size while it's doing its thing. 
But again, try that pragma and take a look.



-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Kevin O'Gorman
Sent: Friday, August 05, 2016 3:41 PM
To: SQLite mailing list
Subject: Re: [sqlite] newbie has waited days for a DB build to complete. what's 
up with this.

On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote:

> Hi, Kevin,
>
> On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
> wrote:
> > Okay, I followed some of the advice y'all gave and got some results.
> >
> > 1. The original problem was compromised by malformed input.  However, it
> > appears that did not cause the wedging of the process.  See (3) below.
>
> Where are the data will come?
> From the user? Internet?
>
> What I'm getting at is - you need to look for some malformed data in
> the future as well.
>

I generate it.  I goofed, and I'll try not to goof in the future.


>
> >
> > 2. I separated the steps, and started small.  Time increased slightly
> > sub-linearly with dataset size, so I jumped to doing the whole thing.
> With
> > proper input, the data was loaded in 68 minutes.
> >
> > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database
> > or disk is full" which seemed odd since most of my partitions have much
> > more free space than the entire database.  It turns out that whatever
> does
> > the creation was using space on my root partition (this is Linux, so that
> > means "/").  That's the only partition in my setup without a huge amount
> of
> > free space.  On would expect temporary stuff to go to /tmp (which has 3TB
> > free), but it doesn't go there.  I would go there if the system's native
> > "sort" program were used.  Fortunately, it turns out that the TMPDIR
> > environment variable is honored, but while I could see space was being
> > used, there were no files visible.  I take that to mean that the
> tmpfile()
> > function (or equivalent) was used.  This could be a bad idea for large
> > indexes because anonymous files have to be kept open, and there's a limit
> > on the number of files that can be open at a time, around 1,000.  Sure
> > enough, the index creation appears to be wedged like the original run,
> and
> > after a few hours I killed it manually.  This is a deal-killer.
>
> The failure you saw - is it on the table with the complete data set?
> Or you got it during the experimenting?
>
> Only on the complete data set.


> >
> > So the questions are: Where do bug reports go?  I seem to be running
> 3.8.2;
> > is this fixed in any later version?
>
> You can try the "3.14" pre-released one right now. ;-)
>

Meh.  I submitted a bug report to this list.  I'll see what happens.


> Thank you.
>
> >
> >
> > On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman <kevinogorm...@gmail.com>
> > wrote:
> >
> >> The metric for feasability is coding ease, not runtime.  I'm the
> >> bottleneck, not the machine, at least at this point.
> >>
> >> As for adding rows, it will be about like this time: a billion or so at
> a
> >> time.  But there's no need to save the old data.  Each round can be
> >> separate except for a persistent "solutions" table of much more modest
> >> size.  I've been doing this for a while now, and the solutions file has
> >> only 10 million or so lines, each representing a game position for which
> >> optimum moves are known.  Getting this file to include the starting
> >> position is the point of the exercise.
> >>
> >> If I ever get to anything like "production" in this project, I expect it
> >> to run for maybe three years...  That's after I tweak it for speed.
> >>
> >> Background: in production, this will be running on a dual-Xe

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Kevin O'Gorman
On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot  wrote:

> Hi, Kevin,
>
> On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman 
> wrote:
> > Okay, I followed some of the advice y'all gave and got some results.
> >
> > 1. The original problem was compromised by malformed input.  However, it
> > appears that did not cause the wedging of the process.  See (3) below.
>
> Where are the data will come?
> From the user? Internet?
>
> What I'm getting at is - you need to look for some malformed data in
> the future as well.
>

I generate it.  I goofed, and I'll try not to goof in the future.


>
> >
> > 2. I separated the steps, and started small.  Time increased slightly
> > sub-linearly with dataset size, so I jumped to doing the whole thing.
> With
> > proper input, the data was loaded in 68 minutes.
> >
> > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database
> > or disk is full" which seemed odd since most of my partitions have much
> > more free space than the entire database.  It turns out that whatever
> does
> > the creation was using space on my root partition (this is Linux, so that
> > means "/").  That's the only partition in my setup without a huge amount
> of
> > free space.  On would expect temporary stuff to go to /tmp (which has 3TB
> > free), but it doesn't go there.  I would go there if the system's native
> > "sort" program were used.  Fortunately, it turns out that the TMPDIR
> > environment variable is honored, but while I could see space was being
> > used, there were no files visible.  I take that to mean that the
> tmpfile()
> > function (or equivalent) was used.  This could be a bad idea for large
> > indexes because anonymous files have to be kept open, and there's a limit
> > on the number of files that can be open at a time, around 1,000.  Sure
> > enough, the index creation appears to be wedged like the original run,
> and
> > after a few hours I killed it manually.  This is a deal-killer.
>
> The failure you saw - is it on the table with the complete data set?
> Or you got it during the experimenting?
>
> Only on the complete data set.


> >
> > So the questions are: Where do bug reports go?  I seem to be running
> 3.8.2;
> > is this fixed in any later version?
>
> You can try the "3.14" pre-released one right now. ;-)
>

Meh.  I submitted a bug report to this list.  I'll see what happens.


> Thank you.
>
> >
> >
> > On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman 
> > wrote:
> >
> >> The metric for feasability is coding ease, not runtime.  I'm the
> >> bottleneck, not the machine, at least at this point.
> >>
> >> As for adding rows, it will be about like this time: a billion or so at
> a
> >> time.  But there's no need to save the old data.  Each round can be
> >> separate except for a persistent "solutions" table of much more modest
> >> size.  I've been doing this for a while now, and the solutions file has
> >> only 10 million or so lines, each representing a game position for which
> >> optimum moves are known.  Getting this file to include the starting
> >> position is the point of the exercise.
> >>
> >> If I ever get to anything like "production" in this project, I expect it
> >> to run for maybe three years...  That's after I tweak it for speed.
> >>
> >> Background: in production, this will be running on a dual-Xeon with 16
> >> cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file update
> >> through Linux flock() calls at the moment.  The code is bash gluing
> >> together a collection of UNIX utilities and some custom C code.  The C
> is
> >> kept as simple as possible, to minimize errors.
> >>
> >> As you may surmise, this "hobby" is important to me.
> >>
> >>
> >> On Thu, Aug 4, 2016 at 9:09 AM, R Smith  wrote:
> >>
> >>>
> >>>
> >>> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:
> >>>
>  On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne <
> ddevie...@gmail.com>
>  wrote:
> 
> 
>  It's even less dense than that.  Each character has only 3 possible
>  values,
>  and thus it's pretty easy to compress down to 2 bits each, for a 16
> byte
>  blob.
>  It's just hard to do that without a bunch of SQLite code I'd have to
>  learn
>  how to write.  The current effort amounts to a feasibility study, and
> I
>  want
>  to keep it as simple as possible.
> 
> >>>
> >>> A feasibility study using equipment that are hamstrung by weights they
> >>> won't have in the real situation is not an accurate study.
> >>>
> >>> It's like studying fuel consumption on a different kind of road
> surface,
> >>> but for the test purposes, the cars had to tow caravans containing
> their
> >>> testing equipment - the study will not look feasible at all.
> >>>
> >>> It might of course be that the feasibility you are studying is
> completely
> >>> unrelated to the data handling - in which case the point is moot.
> >>>
> >>> Let us know 

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Igor Korot
Hi, Kevin,

On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman  wrote:
> Okay, I followed some of the advice y'all gave and got some results.
>
> 1. The original problem was compromised by malformed input.  However, it
> appears that did not cause the wedging of the process.  See (3) below.

Where are the data will come?
From the user? Internet?

What I'm getting at is - you need to look for some malformed data in
the future as well.

>
> 2. I separated the steps, and started small.  Time increased slightly
> sub-linearly with dataset size, so I jumped to doing the whole thing.  With
> proper input, the data was loaded in 68 minutes.
>
> 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database
> or disk is full" which seemed odd since most of my partitions have much
> more free space than the entire database.  It turns out that whatever does
> the creation was using space on my root partition (this is Linux, so that
> means "/").  That's the only partition in my setup without a huge amount of
> free space.  On would expect temporary stuff to go to /tmp (which has 3TB
> free), but it doesn't go there.  I would go there if the system's native
> "sort" program were used.  Fortunately, it turns out that the TMPDIR
> environment variable is honored, but while I could see space was being
> used, there were no files visible.  I take that to mean that the tmpfile()
> function (or equivalent) was used.  This could be a bad idea for large
> indexes because anonymous files have to be kept open, and there's a limit
> on the number of files that can be open at a time, around 1,000.  Sure
> enough, the index creation appears to be wedged like the original run, and
> after a few hours I killed it manually.  This is a deal-killer.

The failure you saw - is it on the table with the complete data set?
Or you got it during the experimenting?

>
> So the questions are: Where do bug reports go?  I seem to be running 3.8.2;
> is this fixed in any later version?

You can try the "3.14" pre-released one right now. ;-)

Thank you.

>
>
> On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman 
> wrote:
>
>> The metric for feasability is coding ease, not runtime.  I'm the
>> bottleneck, not the machine, at least at this point.
>>
>> As for adding rows, it will be about like this time: a billion or so at a
>> time.  But there's no need to save the old data.  Each round can be
>> separate except for a persistent "solutions" table of much more modest
>> size.  I've been doing this for a while now, and the solutions file has
>> only 10 million or so lines, each representing a game position for which
>> optimum moves are known.  Getting this file to include the starting
>> position is the point of the exercise.
>>
>> If I ever get to anything like "production" in this project, I expect it
>> to run for maybe three years...  That's after I tweak it for speed.
>>
>> Background: in production, this will be running on a dual-Xeon with 16
>> cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file update
>> through Linux flock() calls at the moment.  The code is bash gluing
>> together a collection of UNIX utilities and some custom C code.  The C is
>> kept as simple as possible, to minimize errors.
>>
>> As you may surmise, this "hobby" is important to me.
>>
>>
>> On Thu, Aug 4, 2016 at 9:09 AM, R Smith  wrote:
>>
>>>
>>>
>>> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:
>>>
 On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne 
 wrote:


 It's even less dense than that.  Each character has only 3 possible
 values,
 and thus it's pretty easy to compress down to 2 bits each, for a 16 byte
 blob.
 It's just hard to do that without a bunch of SQLite code I'd have to
 learn
 how to write.  The current effort amounts to a feasibility study, and I
 want
 to keep it as simple as possible.

>>>
>>> A feasibility study using equipment that are hamstrung by weights they
>>> won't have in the real situation is not an accurate study.
>>>
>>> It's like studying fuel consumption on a different kind of road surface,
>>> but for the test purposes, the cars had to tow caravans containing their
>>> testing equipment - the study will not look feasible at all.
>>>
>>> It might of course be that the feasibility you are studying is completely
>>> unrelated to the data handling - in which case the point is moot.
>>>
>>> Let us know how it goes :)
>>> Ryan
>>>
>>> ___
>>> sqlite-users mailing list
>>> sqlite-users@mailinglists.sqlite.org
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>
>>
>>
>> --
>> #define QUESTION ((bb) || (!bb)) /* Shakespeare */
>>
>
>
>
> --
> #define QUESTION ((bb) || (!bb)) /* Shakespeare */
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> 

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-05 Thread Kevin O'Gorman
Okay, I followed some of the advice y'all gave and got some results.

1. The original problem was compromised by malformed input.  However, it
appears that did not cause the wedging of the process.  See (3) below.

2. I separated the steps, and started small.  Time increased slightly
sub-linearly with dataset size, so I jumped to doing the whole thing.  With
proper input, the data was loaded in 68 minutes.

3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database
or disk is full" which seemed odd since most of my partitions have much
more free space than the entire database.  It turns out that whatever does
the creation was using space on my root partition (this is Linux, so that
means "/").  That's the only partition in my setup without a huge amount of
free space.  On would expect temporary stuff to go to /tmp (which has 3TB
free), but it doesn't go there.  I would go there if the system's native
"sort" program were used.  Fortunately, it turns out that the TMPDIR
environment variable is honored, but while I could see space was being
used, there were no files visible.  I take that to mean that the tmpfile()
function (or equivalent) was used.  This could be a bad idea for large
indexes because anonymous files have to be kept open, and there's a limit
on the number of files that can be open at a time, around 1,000.  Sure
enough, the index creation appears to be wedged like the original run, and
after a few hours I killed it manually.  This is a deal-killer.

So the questions are: Where do bug reports go?  I seem to be running 3.8.2;
is this fixed in any later version?


On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman 
wrote:

> The metric for feasability is coding ease, not runtime.  I'm the
> bottleneck, not the machine, at least at this point.
>
> As for adding rows, it will be about like this time: a billion or so at a
> time.  But there's no need to save the old data.  Each round can be
> separate except for a persistent "solutions" table of much more modest
> size.  I've been doing this for a while now, and the solutions file has
> only 10 million or so lines, each representing a game position for which
> optimum moves are known.  Getting this file to include the starting
> position is the point of the exercise.
>
> If I ever get to anything like "production" in this project, I expect it
> to run for maybe three years...  That's after I tweak it for speed.
>
> Background: in production, this will be running on a dual-Xeon with 16
> cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file update
> through Linux flock() calls at the moment.  The code is bash gluing
> together a collection of UNIX utilities and some custom C code.  The C is
> kept as simple as possible, to minimize errors.
>
> As you may surmise, this "hobby" is important to me.
>
>
> On Thu, Aug 4, 2016 at 9:09 AM, R Smith  wrote:
>
>>
>>
>> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:
>>
>>> On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne 
>>> wrote:
>>>
>>>
>>> It's even less dense than that.  Each character has only 3 possible
>>> values,
>>> and thus it's pretty easy to compress down to 2 bits each, for a 16 byte
>>> blob.
>>> It's just hard to do that without a bunch of SQLite code I'd have to
>>> learn
>>> how to write.  The current effort amounts to a feasibility study, and I
>>> want
>>> to keep it as simple as possible.
>>>
>>
>> A feasibility study using equipment that are hamstrung by weights they
>> won't have in the real situation is not an accurate study.
>>
>> It's like studying fuel consumption on a different kind of road surface,
>> but for the test purposes, the cars had to tow caravans containing their
>> testing equipment - the study will not look feasible at all.
>>
>> It might of course be that the feasibility you are studying is completely
>> unrelated to the data handling - in which case the point is moot.
>>
>> Let us know how it goes :)
>> Ryan
>>
>> ___
>> sqlite-users mailing list
>> sqlite-users@mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>
>
> --
> #define QUESTION ((bb) || (!bb)) /* Shakespeare */
>



-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Kevin O'Gorman
The metric for feasability is coding ease, not runtime.  I'm the
bottleneck, not the machine, at least at this point.

As for adding rows, it will be about like this time: a billion or so at a
time.  But there's no need to save the old data.  Each round can be
separate except for a persistent "solutions" table of much more modest
size.  I've been doing this for a while now, and the solutions file has
only 10 million or so lines, each representing a game position for which
optimum moves are known.  Getting this file to include the starting
position is the point of the exercise.

If I ever get to anything like "production" in this project, I expect it to
run for maybe three years...  That's after I tweak it for speed.

Background: in production, this will be running on a dual-Xeon with 16
cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file update
through Linux flock() calls at the moment.  The code is bash gluing
together a collection of UNIX utilities and some custom C code.  The C is
kept as simple as possible, to minimize errors.

As you may surmise, this "hobby" is important to me.


On Thu, Aug 4, 2016 at 9:09 AM, R Smith  wrote:

>
>
> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:
>
>> On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne 
>> wrote:
>>
>>
>> It's even less dense than that.  Each character has only 3 possible
>> values,
>> and thus it's pretty easy to compress down to 2 bits each, for a 16 byte
>> blob.
>> It's just hard to do that without a bunch of SQLite code I'd have to learn
>> how to write.  The current effort amounts to a feasibility study, and I
>> want
>> to keep it as simple as possible.
>>
>
> A feasibility study using equipment that are hamstrung by weights they
> won't have in the real situation is not an accurate study.
>
> It's like studying fuel consumption on a different kind of road surface,
> but for the test purposes, the cars had to tow caravans containing their
> testing equipment - the study will not look feasible at all.
>
> It might of course be that the feasibility you are studying is completely
> unrelated to the data handling - in which case the point is moot.
>
> Let us know how it goes :)
> Ryan
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread R Smith



On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:

On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne 
wrote:


It's even less dense than that.  Each character has only 3 possible values,
and thus it's pretty easy to compress down to 2 bits each, for a 16 byte
blob.
It's just hard to do that without a bunch of SQLite code I'd have to learn
how to write.  The current effort amounts to a feasibility study, and I want
to keep it as simple as possible.


A feasibility study using equipment that are hamstrung by weights they 
won't have in the real situation is not an accurate study.


It's like studying fuel consumption on a different kind of road surface, 
but for the test purposes, the cars had to tow caravans containing their 
testing equipment - the study will not look feasible at all.


It might of course be that the feasibility you are studying is 
completely unrelated to the data handling - in which case the point is moot.


Let us know how it goes :)
Ryan
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Kevin O'Gorman
On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne 
wrote:

> On Thu, Aug 4, 2016 at 5:05 PM, Kevin O'Gorman 
> wrote:
>
> > 3. Positions are 64 bytes always, so your size guesses are right.  They
> are
> > in no particular order.  I like the suggestion of a separate position
> > table, because they're going to appear in multiple qmove records, with an
> > average of about 3 or 4 appearances I think.  Maybe more.
> >
>
> 3x or 4x duplication doesn't sound like a lot. What you'd gain in the moves
> table,
> you'd lose having to join to the positions table I suspect. Higher level
> duplicates, maybe.
>
> 64-bytes always? Is that a human readable string, or some 'code' in
> hexadecimal?
> If the latter, use a blob, which requires only 32-bytes to store the same
> info. You can
> use the blob literal notation x'abcdef01' (that's a 4-bytes blob).
>

It's even less dense than that.  Each character has only 3 possible values,
and thus it's pretty easy to compress down to 2 bits each, for a 16 byte
blob.
It's just hard to do that without a bunch of SQLite code I'd have to learn
how to write.  The current effort amounts to a feasibility study, and I want
to keep it as simple as possible.


> Finally, note that if your program writes a huge text file with all your
> values, that you
> .import into sqlite3 as you showed, you're IMHO wasting time, since you
> can't use
> prepared statements and binds, and you also force SQLite's SQL parser to
> parse
> a huge amount of text. By embedding SQLite into your generator program, you
> remove all parsing except for a trivial "insert into qmoves values (:1, :2,
> ...)", and
> all the rest is sqlite_bind*() and co. calls. (and if blob situation for
> positions, then
> you can bind the 32-bytes blob directly, no need to convert/parse to/from
> hex).
>

I understand the concept of prepared statements in principle, sort of,
don't how
binds work really, so I'm not quite ready to write the code you allude to.
And I'd
no longer be able to use sqlite3 at all to do simple experiments -- I'd
have to
be blobbing and de-blobbing to make sense of anything.


>
> My $0.02. --DD
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread R Smith



On 2016/08/04 5:05 PM, Kevin O'Gorman wrote:

Lots of answers, so thanks all around.  Some more info:

1. All partitions have at least 3 GB free, and it's not changing.  /tmp is
3 TiB and empty.
2. I have a RAID partition, for size, but no RAID controller.  As a hobby
project, I don't have spare parts, and I fear the results of a failure of a
hardware RAID without a spare, so I use Linux mdadm to manage software RAID
across three 4-TB drives.
3. Positions are 64 bytes always, so your size guesses are right.  They are
in no particular order.  I like the suggestion of a separate position
table, because they're going to appear in multiple qmove records, with an
average of about 3 or 4 appearances I think.  Maybe more.

I'm going to retry, using some of the suggestions above: smaller
transactions, IGNORE, position table rowids in the moves table, smaller
subsets being loaded, developing ideas of how time depends on data size.

If it doesn't go well fairly quickly, I'll probably go back to flat files
and writing the queries the hard way.  At least I know what I'm dealing
with there.


Note that there is nothing about SQLite that isn't working extremely 
well for your purpose - going back to flat files is akin to going back 
to the dark ages. Databases of this size (and even much larger) work 
perfectly all around with some people here able to quote you more 
staggering figures even - but it does require some special processing 
which might be thwarted by your current system layout or design choices. 
however, if done right, it should not be significantly slower than 
writing flat files - so try to persist to find what "done right" entails.


Remember that once this has worked - you can query the data with SQL... 
that is an amazingly powerful feature over flat files, and once an index 
exists (however slow it took to make), queries will be lightning fast, 
something a flat-file can never give you.




QUESTIONS:
If it's best in general to build indexes as a separate step, does this also
apply to primary indexes?  Can a table without a primary index have one
added later?  Isn't ROWID the real primary, presuming it has ROWIDs?  And
if so, then is a primary index on a ROWID table just for compliance with
standard SQL, and really no better than any other index?  Obviously, I'm a
bit confused about this.


A primary index is nothing other than a normal Unique index which has a 
guaranteed 1-to-1 key-to-row ratio and has look-up priority status. All 
SQL tables MUST have such a key to be able to guarantee access to any 
individual row, and if you omit the primary key bit, then some internal 
mechanism is used for it - in SQLite's case, this mechanism is called 
the row_id. A nice recent feature of SQLite allows you to get rid of 
this row_id overhead by explicitly specifying a Primary Key and then 
WITHOUT ROWID after the Table declaration.


The advantage of NOT using an explicit Primary key from the start is 
that the row_id can simply be added by increments during insert 
statements due to its auto-supplied-by-the-db-engine and INT type 
nature. A primary key has to be sorted (or at least, has to determine 
the correct sort-order insert position) for every added row. This costs 
a good bit of time on really large insert operations. The bulky single 
sort operation while building the PK index after-the-fact takes a lot 
less time than the repeated look-up-insert operations for the key while 
making the table.


I hope this answers the question, but feel free to ask more if anything 
remains unclear.




While I'm at it, I may as well ask if ROWID has any physical significance,
such that a VACUUM operation might change it.  Or is it just an arbitrary
ID inserted by SQLite and added to each record when they exist at all.


It is arbitrary, it is supplied by the DB engine in general, but it will 
never be changed by anything! That would go against all SQL and RDBMS 
premises.
You could supply it yourself to (and I often advocate this to be the 
better practice).


If you declare a table with a primary key specified /exactly/ like this:
CREATE myTable ("ID" INTEGER PRIMARY KEY,  )
Then the "ID" in this case becomes an alias for the row_id and 
manipulating/reading the value in it is in fact reading / manipulating 
the actual row_id.


(The "ID" can be anything else you like, but the "INTEGER PRIMARY KEY" 
part needs to be exactly written like that).




The current dataset is intended to solve one particular issue in the
overall project.  It looks like I'd want to build each such dataset
separately, as there will likely be a few hundred, and I gather that adding
to these tables will be pretty slow once the indexes have been built.  Or
is it sensible to drop indexes, add data and rebuild?


No this is not true - it will be really fast to add another few rows... 
it's just slow when you add the initial few zillion rows due to simple 
laws of quantity. If however you will be adding rows at an amazing rate, 
I 

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Dominique Devienne
On Thu, Aug 4, 2016 at 5:29 PM, Dominique Devienne 
wrote:

> [...] you also force SQLite's SQL parser to parse a huge amount of text.
> [...]
>

OK, maybe not the SQL parser, depends what you write out and the .import
mode
(I guess, didn't look into the details). But for sure "some" parser (CSV,
SQL, or else...)
So I'm suggesting to eliminate the text "middle man", and use SQLite
directly with native values
(ints, strings, blobs, whatever). Of course I assume your program is C/C++,
which given
your sizes, it better be :). --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Dominique Devienne
On Thu, Aug 4, 2016 at 5:05 PM, Kevin O'Gorman 
wrote:

> 3. Positions are 64 bytes always, so your size guesses are right.  They are
> in no particular order.  I like the suggestion of a separate position
> table, because they're going to appear in multiple qmove records, with an
> average of about 3 or 4 appearances I think.  Maybe more.
>

3x or 4x duplication doesn't sound like a lot. What you'd gain in the moves
table,
you'd lose having to join to the positions table I suspect. Higher level
duplicates, maybe.

64-bytes always? Is that a human readable string, or some 'code' in
hexadecimal?
If the latter, use a blob, which requires only 32-bytes to store the same
info. You can
use the blob literal notation x'abcdef01' (that's a 4-bytes blob).

Finally, note that if your program writes a huge text file with all your
values, that you
.import into sqlite3 as you showed, you're IMHO wasting time, since you
can't use
prepared statements and binds, and you also force SQLite's SQL parser to
parse
a huge amount of text. By embedding SQLite into your generator program, you
remove all parsing except for a trivial "insert into qmoves values (:1, :2,
...)", and
all the rest is sqlite_bind*() and co. calls. (and if blob situation for
positions, then
you can bind the 32-bytes blob directly, no need to convert/parse to/from
hex).

My $0.02. --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Kevin O'Gorman
Lots of answers, so thanks all around.  Some more info:

1. All partitions have at least 3 GB free, and it's not changing.  /tmp is
3 TiB and empty.
2. I have a RAID partition, for size, but no RAID controller.  As a hobby
project, I don't have spare parts, and I fear the results of a failure of a
hardware RAID without a spare, so I use Linux mdadm to manage software RAID
across three 4-TB drives.
3. Positions are 64 bytes always, so your size guesses are right.  They are
in no particular order.  I like the suggestion of a separate position
table, because they're going to appear in multiple qmove records, with an
average of about 3 or 4 appearances I think.  Maybe more.

I'm going to retry, using some of the suggestions above: smaller
transactions, IGNORE, position table rowids in the moves table, smaller
subsets being loaded, developing ideas of how time depends on data size.

If it doesn't go well fairly quickly, I'll probably go back to flat files
and writing the queries the hard way.  At least I know what I'm dealing
with there.

QUESTIONS:
If it's best in general to build indexes as a separate step, does this also
apply to primary indexes?  Can a table without a primary index have one
added later?  Isn't ROWID the real primary, presuming it has ROWIDs?  And
if so, then is a primary index on a ROWID table just for compliance with
standard SQL, and really no better than any other index?  Obviously, I'm a
bit confused about this.

While I'm at it, I may as well ask if ROWID has any physical significance,
such that a VACUUM operation might change it.  Or is it just an arbitrary
ID inserted by SQLite and added to each record when they exist at all.

The current dataset is intended to solve one particular issue in the
overall project.  It looks like I'd want to build each such dataset
separately, as there will likely be a few hundred, and I gather that adding
to these tables will be pretty slow once the indexes have been built.  Or
is it sensible to drop indexes, add data and rebuild?

On Thu, Aug 4, 2016 at 7:27 AM, Jim Callahan  wrote:

> Temp Files
> Have you checked how much storage is available to the temporary file
> locations?
> The temporary file locations are different depending on the OS, build, VFS
> and PRAGMA settings.
> See the last section "5.0 Temporary File Storage Locations" of:
> https://www.sqlite.org/tempfiles.html
>
>
> The database was growing for about 1-1/2 days.  Then its journal
> > disappeared, the file size dropped to zero, but sqlite3 is still running
> > 100% CPU time, now for a total of 3800+ minutes (63+ hours).  The
> database
> > is still locked, but I have no idea what sqlite3 is doing, or if it will
> > ever stop.  All partitions still have lots of space left (most of this is
> > running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
> > my Linux system:
>
>
> You might have a huge storage allocation for the main file and log, but
> some other temp file might be being dumped
> to a more constrained storage location.
>
> RAM
> Since you are using RAID disk controller; I assume you have 64 bit CPU and
> more than 8 GB of RAM?
> If you have 8 GB or more of RAM would it help to use an in memory database?
>
> Transactions
> Are you using explicit or implicit transactions?
> https://www.sqlite.org/lang_transaction.html
>
>
> Steps
> Agree with Darren Duncan and Dr. Hipp you may want to have at least 3
> separate steps
> (each step should be a separate transaction):
>
> 1. Simple load
> 2. Create additional column
> 3. Create index
>
> Have you pre-defined the table you are loading data into? (step 0 CREATE
> TABLE)
>
> If "Step 1 Simple Load" does not complete; then may want to load a fixed
> number of rows into separate tables (per Darren Duncan)  and then combine
> using an APPEND
> or a UNION query (doing so before steps 2 and 3).
>
> HTH
>
> Jim Callahan
> Data Scientist
> Orlando, FL
>
>
>
>
> On Wed, Aug 3, 2016 at 11:00 PM, Kevin O'Gorman 
> wrote:
>
> > I'm working on a hobby project, but the data has gotten a bit out of
> hand.
> > I thought I'd put it in a real database rather than flat ASCII files.
> >
> > I've got a problem set of about 1 billion game positions and 187GB to
> work
> > on (no, I won't have to solve them all) that took about 4 hours for a
> > generator program just to write.  I wrote code to turn them into
> something
> > SQLite could import.  Actually, it's import, build a non-primary index,
> and
> > alter table to add a column, all in sqlite3.
> >
> > The database was growing for about 1-1/2 days.  Then its journal
> > disappeared, the file size dropped to zero, but sqlite3 is still running
> > 100% CPU time, now for a total of 3800+ minutes (63+ hours).  The
> database
> > is still locked, but I have no idea what sqlite3 is doing, or if it will
> > ever stop.  All partitions still have lots of space left (most of this is
> > running in a RAID partition of 11 TiB).  

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Jim Callahan
Temp Files
Have you checked how much storage is available to the temporary file
locations?
The temporary file locations are different depending on the OS, build, VFS
and PRAGMA settings.
See the last section "5.0 Temporary File Storage Locations" of:
https://www.sqlite.org/tempfiles.html


The database was growing for about 1-1/2 days.  Then its journal
> disappeared, the file size dropped to zero, but sqlite3 is still running
> 100% CPU time, now for a total of 3800+ minutes (63+ hours).  The database
> is still locked, but I have no idea what sqlite3 is doing, or if it will
> ever stop.  All partitions still have lots of space left (most of this is
> running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
> my Linux system:


You might have a huge storage allocation for the main file and log, but
some other temp file might be being dumped
to a more constrained storage location.

RAM
Since you are using RAID disk controller; I assume you have 64 bit CPU and
more than 8 GB of RAM?
If you have 8 GB or more of RAM would it help to use an in memory database?

Transactions
Are you using explicit or implicit transactions?
https://www.sqlite.org/lang_transaction.html


Steps
Agree with Darren Duncan and Dr. Hipp you may want to have at least 3
separate steps
(each step should be a separate transaction):

1. Simple load
2. Create additional column
3. Create index

Have you pre-defined the table you are loading data into? (step 0 CREATE
TABLE)

If "Step 1 Simple Load" does not complete; then may want to load a fixed
number of rows into separate tables (per Darren Duncan)  and then combine
using an APPEND
or a UNION query (doing so before steps 2 and 3).

HTH

Jim Callahan
Data Scientist
Orlando, FL




On Wed, Aug 3, 2016 at 11:00 PM, Kevin O'Gorman 
wrote:

> I'm working on a hobby project, but the data has gotten a bit out of hand.
> I thought I'd put it in a real database rather than flat ASCII files.
>
> I've got a problem set of about 1 billion game positions and 187GB to work
> on (no, I won't have to solve them all) that took about 4 hours for a
> generator program just to write.  I wrote code to turn them into something
> SQLite could import.  Actually, it's import, build a non-primary index, and
> alter table to add a column, all in sqlite3.
>
> The database was growing for about 1-1/2 days.  Then its journal
> disappeared, the file size dropped to zero, but sqlite3 is still running
> 100% CPU time, now for a total of 3800+ minutes (63+ hours).  The database
> is still locked, but I have no idea what sqlite3 is doing, or if it will
> ever stop.  All partitions still have lots of space left (most of this is
> running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
> my Linux system:
>
> time sqlite3 qubic.db < BEGIN EXCLUSIVE TRANSACTION;
> DROP TABLE IF EXISTS qmoves;
> CREATE TABLE qmoves (
>   qfrom CHAR(64),
>   qmove INT,
>   qto   CHAR(64),
>   qweight INT,
>   PRIMARY KEY (qfrom, qmove) ON CONFLICT ROLLBACK
> );
> CREATE INDEX IF NOT EXISTS qmoves_by_dest ON qmoves (
>   qto,
>   qweight
> );
> CREATE TABLE IF NOT EXISTS qposn (
>   qposn CHAR(64) PRIMARY KEY ON CONFLICT ROLLBACK,
>   qmaxval INT,
>   qmove INT,
>   qminval INT,
>   qstatus INT
> );
> .separator " "
> .import am.all qmoves
> ALTER TABLE qmoves ADD COLUMN qstatus INT DEFAULT NULL;
> .schema
> COMMIT TRANSACTION;
>
> EOF
>
> Any clues, hints, or advice?
>
>
> --
> #define QUESTION ((bb) || (!bb)) /* Shakespeare */
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Richard Hipp
On 8/4/16, Wade, William  wrote:
>
> I believe that with SQLite, if you don't specify WITHOUT ROWID your "real"
> record order is based on rowid,

Correct

>
> In principle, indices can be created by writing the needed information
> (index key, record position) in the original order, and then sorting that
> into key-order. That can be done with many less random seeks (merge sorts
> involve mostly sequential reads and writes). I don't know if, or when,
> SQLite does that.
>

SQLite runs CREATE INDEX commands using an external merge sort, which
is what I think you are eluding to above.

But if the index already exists, and you are merely inserting new rows
into the table, then each index entry is inserted separately.  Each
such insert is an O(logN) operation.  Such isolated inserts typically
involve a lot of disk seeks and write amplification.  That is why we
recommend that you populate large tables first and then run CREATE
INDEX, rather than the other way around, whenever practical.

-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Wade, William
A lot of speculation here. I am certainly no SQLite expert.

Your input has 1g positions, taking 187gb, so averaging 187b/position. From 
your CREATE TABLE, it looks like to get that size most of your qfrom and qto 
are fairly long strings. I'm assuming there are a great many duplications in 
those positions. If so, put them into a separate position table { positionId 
INT, positionName TEXT }, with positionId as the primary key and positionName 
also being unique. This will be even more useful if you have a fixed set of 
possible positions, and you make it so that positionId is increasing whenever 
positionName is increasing. In your qmoves table, store positionId values, 
rather than postionName values. Saves a lot of space because no name is in the 
database more than once, and most of your space is related to names.

Space is important, because at a few hundred gb, your database is probably 
bigger than all of your available fast cache's, and you are probably storing 
your data on rotating storage. Writes to random positions might easily average 
10ms, and the import of most of your records may involve one or more such a 
writes by the time indices are updated. Reducing sizes mean fewer such writes, 
because things are more likely to fit in the various caches.

I believe that with SQLite, if you don't specify WITHOUT ROWID your "real" 
record order is based on rowid, so complete records are, perhaps, stored in 
import order. The PRIMARY KEY you specified is large (a name and an integer), 
so the index entries are likely about half as large as a complete record, and 
they don't fit in cache either. It is also likely that your input data was not 
in sorted order. That means that adding an entry to that index likely involves 
a write to a random position. 10ms * 1g = 10e6s, or about three months.

Your qmoves_by_dest index is about as large as your primary (it also has a 
position and an integer), so similar timings might apply.

I suggest trying timing your script on smaller inputs (1gb, 10gb, 20gb) and 
seeing if there is a size where things suddenly get worse (available caches are 
not big enough). See if my suggestions help those timings. See if WITHOUT ROWID 
helps those timings.

In principle, indices can be created by writing the needed information (index 
key, record position) in the original order, and then sorting that into 
key-order. That can be done with many less random seeks (merge sorts involve 
mostly sequential reads and writes). I don't know if, or when, SQLite does that.

Regards,
Bill

-Original Message-
From: Kevin O'Gorman [mailto:kevinogorm...@gmail.com]
Sent: Wednesday, August 03, 2016 10:00 PM
To: sqlite-users
Subject: [sqlite] newbie has waited days for a DB build to complete. what's up 
with this.

I'm working on a hobby project, but the data has gotten a bit out of hand.
I thought I'd put it in a real database rather than flat ASCII files.

I've got a problem set of about 1 billion game positions and 187GB to work on 
(no, I won't have to solve them all) that took about 4 hours for a generator 
program just to write.  I wrote code to turn them into something SQLite could 
import.  Actually, it's import, build a non-primary index, and alter table to 
add a column, all in sqlite3.

The database was growing for about 1-1/2 days.  Then its journal disappeared, 
the file size dropped to zero, but sqlite3 is still running 100% CPU time, now 
for a total of 3800+ minutes (63+ hours).  The database is still locked, but I 
have no idea what sqlite3 is doing, or if it will ever stop.  All partitions 
still have lots of space left (most of this is running in a RAID partition of 
11 TiB).  Here's what I gave to sqlite3 on my Linux system:

time sqlite3 qubic.db <http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Teg
Hello Kevin,

I'd  write  a  utility to do it instead of using the command
line  tool  then add logging to the program in order to note
progress.

I like the idea of chopping it into smaller parts too.

"ON CONFLICT ROLLBACK"

You're  doing  one  large  transaction  and if it rolls back
it'll have to undo everything right? I wonder if that's what
you're seeing here. You might want to try "IGNORE" to see if
you can even run through all the data.

Writing  a  utility  for  this  would  let  you manage the
rollbacks too.

C


Wednesday, August 3, 2016, 11:00:12 PM, you wrote:

KOG> I'm working on a hobby project, but the data has gotten a bit out of hand.
KOG> I thought I'd put it in a real database rather than flat ASCII files.

KOG> I've got a problem set of about 1 billion game
KOG> positions and 187GB to work
KOG> on (no, I won't have to solve them all) that took about 4 hours for a
KOG> generator program just to write.  I wrote code to turn them into something
KOG> SQLite could import.  Actually, it's import, build a non-primary index, and
KOG> alter table to add a column, all in sqlite3.

KOG> The database was growing for about 1-1/2 days.  Then its journal
KOG> disappeared, the file size dropped to zero, but sqlite3 is still running
KOG> 100% CPU time, now for a total of 3800+ minutes (63+ hours).  The database
KOG> is still locked, but I have no idea what sqlite3 is doing, or if it will
KOG> ever stop.  All partitions still have lots of space left (most of this is
KOG> running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
KOG> my Linux system:

KOG> time sqlite3 qubic.db < BEGIN EXCLUSIVE TRANSACTION;
KOG> DROP TABLE IF EXISTS qmoves;
KOG> CREATE TABLE qmoves (
KOG>   qfrom CHAR(64),
KOG>   qmove INT,
KOG>   qto   CHAR(64),
KOG>   qweight INT,
KOG>   PRIMARY KEY (qfrom, qmove) ON CONFLICT ROLLBACK
KOG> );
KOG> CREATE INDEX IF NOT EXISTS qmoves_by_dest ON qmoves (
KOG>   qto,
KOG>   qweight
KOG> );
KOG> CREATE TABLE IF NOT EXISTS qposn (
KOG>   qposn CHAR(64) PRIMARY KEY ON CONFLICT ROLLBACK,
KOG>   qmaxval INT,
KOG>   qmove INT,
KOG>   qminval INT,
KOG>   qstatus INT
KOG> );
KOG> .separator " "
KOG> .import am.all qmoves
KOG> ALTER TABLE qmoves ADD COLUMN qstatus INT DEFAULT NULL;
KOG> .schema
KOG> COMMIT TRANSACTION;

KOG> EOF

KOG> Any clues, hints, or advice?





-- 
 Tegmailto:t...@djii.com

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-04 Thread Simon Slavin

On 4 Aug 2016, at 4:00am, Kevin O'Gorman  wrote:

> I've got a problem set of about 1 billion game positions and 187GB to work
> on (no, I won't have to solve them all) that took about 4 hours for a
> generator program just to write.  I wrote code to turn them into something
> SQLite could import.  Actually, it's import, build a non-primary index, and
> alter table to add a column, all in sqlite3.

Hmm.  The closest I have to that is a 43 Gigabyte sqlite database which works 
fine.  In that case almost all the space is taken up by one thin table which 
has a ridiculous number of rows.

> The database was growing for about 1-1/2 days.  Then its journal
> disappeared, the file size dropped to zero, but sqlite3 is still running

I assume that those are what 'ls' is showing you.  Have you tried using 'lsof' 
or 'strace' ?  Or monitoring the amount of free space on the disk to see if 
it's shrinking ?

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-03 Thread Darren Duncan
One way to get a clue is to try doing this in stages.  First start over and 
import a much smaller amount of data, say just a 1GB fraction say, see if that 
completes, and if it does, how long it takes and other factors like disk and 
memory etc.  If 1GB doesn't work, start smaller yet, until you have a size that 
does work.  When something works, next try something double the size and see if 
the resource usage is about linear or not.  And double it again, etc, and see if 
you can get a time/space progression figures.  That should help you predict how 
long the full 187GB would take were it successful.  Or otherwise at some point 
you should see the smallest point where the hang occurs. -- Darren Duncan


On 2016-08-03 8:00 PM, Kevin O'Gorman wrote:

I'm working on a hobby project, but the data has gotten a bit out of hand.
I thought I'd put it in a real database rather than flat ASCII files.

I've got a problem set of about 1 billion game positions and 187GB to work
on (no, I won't have to solve them all) that took about 4 hours for a
generator program just to write.  I wrote code to turn them into something
SQLite could import.  Actually, it's import, build a non-primary index, and
alter table to add a column, all in sqlite3.

The database was growing for about 1-1/2 days.  Then its journal
disappeared, the file size dropped to zero, but sqlite3 is still running
100% CPU time, now for a total of 3800+ minutes (63+ hours).  The database
is still locked, but I have no idea what sqlite3 is doing, or if it will
ever stop.  All partitions still have lots of space left (most of this is
running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
my Linux system:


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] newbie has waited days for a DB build to complete. what's up with this.

2016-08-03 Thread Kevin O'Gorman
I'm working on a hobby project, but the data has gotten a bit out of hand.
I thought I'd put it in a real database rather than flat ASCII files.

I've got a problem set of about 1 billion game positions and 187GB to work
on (no, I won't have to solve them all) that took about 4 hours for a
generator program just to write.  I wrote code to turn them into something
SQLite could import.  Actually, it's import, build a non-primary index, and
alter table to add a column, all in sqlite3.

The database was growing for about 1-1/2 days.  Then its journal
disappeared, the file size dropped to zero, but sqlite3 is still running
100% CPU time, now for a total of 3800+ minutes (63+ hours).  The database
is still locked, but I have no idea what sqlite3 is doing, or if it will
ever stop.  All partitions still have lots of space left (most of this is
running in a RAID partition of 11 TiB).  Here's what I gave to sqlite3 on
my Linux system:

time sqlite3 qubic.db