Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Sat, Aug 6, 2016 at 2:49 PM, Kevin O'Gormanwrote: > On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedy wrote: > >> On 08/06/2016 09:52 AM, Kevin O'Gorman wrote: >> >>> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy >>> wrote: >>> >>> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: On Fri, Aug 5, 2016 at 1:08 PM, David Raymond > > wrote: > > .. > > Apart from the default location of the files, it reads like your next > main > >> concern is how many temp files get opened up. My bet is that it'll be >> a >> very small number, just potentially huge in file size while it's doing >> its >> thing. But again, try that pragma and take a look. >> >> My best bet is the contrary: it starts with small files and makes >> > increasingly larger ones, like the sort utility does. The problem is > that > there are too many of them at the beginning for it to work with > anonymous > files (which sort does not use). This at least offers a possible > explanation of its getting wedged on large indexes: an unexpected and > untested error, handled poorly. > > You could verify this by checking the number of open handles in "/proc//fd" after your process is wedged. Excellent idea. I did not know about that possibility. And sure enough, >>> I'm wrong. It's using anonymous files, all right, but only one or two >>> at a >>> time. I assume they're big. I'm in the process of bracketing where size >>> begins to matter. So far, 1/10 of the data loads and indexes just fine, >>> albeit somewhat more slowly that the smaller samples predicted. The >>> database load took 6.5 minutes, the troublesome index 10 minutes. At >>> smaller sizes, indexing is faster than the database load. >>> >>> I'm trying 1/3 now (500 million lines) >>> >> >> What does [top] tell you once the process becomes wedged? What percentage >> is the CPU running at? Or is it completely bogged down waiting for IO? >> >> Dan > > > I'm waiting for a good time to get the answer to this. It takes a good > long while to get to wedged, so I'll probably do it overnight tonight. > RETRACTION: it doesn't get wedged after all, it just takes about 4 times longer than I expected. On small inputs, it makes the indexes faster than the data table. When I test on the whole data file, it takes 4 times longer to build each index than it took to make the table. I guess that's what it looks like when building the table is O(n) complexity (linear), and the index is O(n * log(n)) and log(n) is getting bigger. Sorry for the kerfluffle. This is my first time working with datasets this big. Since it takes almost 6 hours to build each index (there are 2 at this point), and I didn't know to expect that, I drew the wrong conclusion. My bad. I'm just glad Mr. Hipp got into this thread and advised to build the indexes after the data is loaded. It would have been _lots_ slower inserting each index individually, and in fact my first attempt -- the one that gave me the initial idea that this had wedged -- took this approach and I decided it was wedged after a few _days_. I suspect it wasn't either, but might as well have been. So far, this has been a feasibility exercise. Building a 500 GB database overnight is definitely feasible. Now I can go on to see if it helps me solve my problem. -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedywrote: > On 08/06/2016 09:52 AM, Kevin O'Gorman wrote: > >> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy >> wrote: >> >> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: >>> >>> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond wrote: .. Apart from the default location of the files, it reads like your next main > concern is how many temp files get opened up. My bet is that it'll be a > very small number, just potentially huge in file size while it's doing > its > thing. But again, try that pragma and take a look. > > My best bet is the contrary: it starts with small files and makes > increasingly larger ones, like the sort utility does. The problem is that there are too many of them at the beginning for it to work with anonymous files (which sort does not use). This at least offers a possible explanation of its getting wedged on large indexes: an unexpected and untested error, handled poorly. You could verify this by checking the number of open handles in >>> "/proc//fd" after your process is wedged. >>> >>> Excellent idea. I did not know about that possibility. And sure enough, >>> >> I'm wrong. It's using anonymous files, all right, but only one or two at >> a >> time. I assume they're big. I'm in the process of bracketing where size >> begins to matter. So far, 1/10 of the data loads and indexes just fine, >> albeit somewhat more slowly that the smaller samples predicted. The >> database load took 6.5 minutes, the troublesome index 10 minutes. At >> smaller sizes, indexing is faster than the database load. >> >> I'm trying 1/3 now (500 million lines) >> > > What does [top] tell you once the process becomes wedged? What percentage > is the CPU running at? Or is it completely bogged down waiting for IO? > > Dan I'm waiting for a good time to get the answer to this. It takes a good long while to get to wedged, so I'll probably do it overnight tonight. -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 08/06/2016 09:52 AM, Kevin O'Gorman wrote: On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedywrote: On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: On Fri, Aug 5, 2016 at 1:08 PM, David Raymond wrote: .. Apart from the default location of the files, it reads like your next main concern is how many temp files get opened up. My bet is that it'll be a very small number, just potentially huge in file size while it's doing its thing. But again, try that pragma and take a look. My best bet is the contrary: it starts with small files and makes increasingly larger ones, like the sort utility does. The problem is that there are too many of them at the beginning for it to work with anonymous files (which sort does not use). This at least offers a possible explanation of its getting wedged on large indexes: an unexpected and untested error, handled poorly. You could verify this by checking the number of open handles in "/proc//fd" after your process is wedged. Excellent idea. I did not know about that possibility. And sure enough, I'm wrong. It's using anonymous files, all right, but only one or two at a time. I assume they're big. I'm in the process of bracketing where size begins to matter. So far, 1/10 of the data loads and indexes just fine, albeit somewhat more slowly that the smaller samples predicted. The database load took 6.5 minutes, the troublesome index 10 minutes. At smaller sizes, indexing is faster than the database load. I'm trying 1/3 now (500 million lines) What does [top] tell you once the process becomes wedged? What percentage is the CPU running at? Or is it completely bogged down waiting for IO? Dan. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedywrote: > On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: > >> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond >> wrote: >> >> .. >> >> Apart from the default location of the files, it reads like your next main >>> concern is how many temp files get opened up. My bet is that it'll be a >>> very small number, just potentially huge in file size while it's doing >>> its >>> thing. But again, try that pragma and take a look. >>> >>> My best bet is the contrary: it starts with small files and makes >> increasingly larger ones, like the sort utility does. The problem is that >> there are too many of them at the beginning for it to work with anonymous >> files (which sort does not use). This at least offers a possible >> explanation of its getting wedged on large indexes: an unexpected and >> untested error, handled poorly. >> > > You could verify this by checking the number of open handles in > "/proc//fd" after your process is wedged. > > Excellent idea. I did not know about that possibility. And sure enough, I'm wrong. It's using anonymous files, all right, but only one or two at a time. I assume they're big. I'm in the process of bracketing where size begins to matter. So far, 1/10 of the data loads and indexes just fine, albeit somewhat more slowly that the smaller samples predicted. The database load took 6.5 minutes, the troublesome index 10 minutes. At smaller sizes, indexing is faster than the database load. I'm trying 1/3 now (500 million lines) -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Fri, Aug 5, 2016 at 3:03 PM, Darren Duncanwrote: > On 2016-08-04 7:27 AM, Jim Callahan wrote: > >> Steps >> Agree with Darren Duncan and Dr. Hipp you may want to have at least 3 >> separate steps >> (each step should be a separate transaction): >> >> 1. Simple load >> 2. Create additional column >> 3. Create index >> >> Have you pre-defined the table you are loading data into? (step 0 CREATE >> TABLE) >> >> If "Step 1 Simple Load" does not complete; then may want to load a fixed >> number of rows into separate tables (per Darren Duncan) and then combine >> using an APPEND >> or a UNION query (doing so before steps 2 and 3). >> > > To be clear, my proposal of only loading a subset of rows was just > intended for debugging the performance issues the OP was having and try and > figure out what is causing the problem without having to wait too long for > larger sets to complete. I wasn't proposing splitting the load into > separate databases and unioning later, rather the attempt database would be > thrown away after each trial. -- Darren Duncan > Thanks, I understood that and it was a good set of suggestions. Using separate transactions led to focusing on the CREATE INDEX part of the process. That's where it gets wedged. Working up to the full database, by first processing smaller extracts, convinced me that size matters. Smaller (even millions of row) extracts do not wedge. -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 2016-08-04 7:27 AM, Jim Callahan wrote: Steps Agree with Darren Duncan and Dr. Hipp you may want to have at least 3 separate steps (each step should be a separate transaction): 1. Simple load 2. Create additional column 3. Create index Have you pre-defined the table you are loading data into? (step 0 CREATE TABLE) If "Step 1 Simple Load" does not complete; then may want to load a fixed number of rows into separate tables (per Darren Duncan) and then combine using an APPEND or a UNION query (doing so before steps 2 and 3). To be clear, my proposal of only loading a subset of rows was just intended for debugging the performance issues the OP was having and try and figure out what is causing the problem without having to wait too long for larger sets to complete. I wasn't proposing splitting the load into separate databases and unioning later, rather the attempt database would be thrown away after each trial. -- Darren Duncan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com> wrote: There's a depreciated pragma, PRAGMA temp_store_directory = 'directory-name'; that apears to still work ok for now at least. http://www.sqlite.org/pragma.html#pragma_temp_store_directory I've used it for the same reasons you've both mentioned about space, though for me it vacuuming a huge db, and the full size db copy that makes which ate up my entire C drive. But with that pragma you can specify exactly which folder you want to use for your temp files. I'd suggest try using that, then monitor the folder you give it to see what shows up. Since it's deprecated, I'll stick with TMPDIR because it's pretty standard on Linux (it also works with the sort utility for instance). Apart from the default location of the files, it reads like your next main concern is how many temp files get opened up. My bet is that it'll be a very small number, just potentially huge in file size while it's doing its thing. But again, try that pragma and take a look. My best bet is the contrary: it starts with small files and makes increasingly larger ones, like the sort utility does. The problem is that there are too many of them at the beginning for it to work with anonymous files (which sort does not use). This at least offers a possible explanation of its getting wedged on large indexes: an unexpected and untested error, handled poorly. You could verify this by checking the number of open handles in "/proc//fd" after your process is wedged. -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Kevin O'Gorman Sent: Friday, August 05, 2016 3:41 PM To: SQLite mailing list Subject: Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this. On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote: Hi, Kevin, On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com> wrote: Okay, I followed some of the advice y'all gave and got some results. 1. The original problem was compromised by malformed input. However, it appears that did not cause the wedging of the process. See (3) below. Where are the data will come? From the user? Internet? What I'm getting at is - you need to look for some malformed data in the future as well. I generate it. I goofed, and I'll try not to goof in the future. 2. I separated the steps, and started small. Time increased slightly sub-linearly with dataset size, so I jumped to doing the whole thing. With proper input, the data was loaded in 68 minutes. 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database or disk is full" which seemed odd since most of my partitions have much more free space than the entire database. It turns out that whatever does the creation was using space on my root partition (this is Linux, so that means "/"). That's the only partition in my setup without a huge amount of free space. On would expect temporary stuff to go to /tmp (which has 3TB free), but it doesn't go there. I would go there if the system's native "sort" program were used. Fortunately, it turns out that the TMPDIR environment variable is honored, but while I could see space was being used, there were no files visible. I take that to mean that the tmpfile() function (or equivalent) was used. This could be a bad idea for large indexes because anonymous files have to be kept open, and there's a limit on the number of files that can be open at a time, around 1,000. Sure enough, the index creation appears to be wedged like the original run, and after a few hours I killed it manually. This is a deal-killer. The failure you saw - is it on the table with the complete data set? Or you got it during the experimenting? Only on the complete data set. So the questions are: Where do bug reports go? I seem to be running 3.8.2; is this fixed in any later version? You can try the "3.14" pre-released one right now. ;-) Meh. I submitted a bug report to this list. I'll see what happens. Thank you. On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman < kevinogorm...@gmail.com> wrote: The metric for feasability is coding ease, not runtime. I'm the bottleneck, not the machine, at least at this point. As for adding rows, it will be about like this time: a billion or so at a time. But there's no need to save the old data. Each round can be separate except for a persistent "solutions" table of much more modest size. I've been doing this for a while now, and the solutions file has only 10 million or so lines, each representing a game position for which optimum moves are known. Getting this file to include the starting position is the point of the exercise. If I ever get to anything like "product
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com> wrote: > There's a depreciated pragma, PRAGMA temp_store_directory = > 'directory-name'; that apears to still work ok for now at least. > http://www.sqlite.org/pragma.html#pragma_temp_store_directory > > I've used it for the same reasons you've both mentioned about space, > though for me it vacuuming a huge db, and the full size db copy that makes > which ate up my entire C drive. But with that pragma you can specify > exactly which folder you want to use for your temp files. I'd suggest try > using that, then monitor the folder you give it to see what shows up. > Since it's deprecated, I'll stick with TMPDIR because it's pretty standard on Linux (it also works with the sort utility for instance). > Apart from the default location of the files, it reads like your next main > concern is how many temp files get opened up. My bet is that it'll be a > very small number, just potentially huge in file size while it's doing its > thing. But again, try that pragma and take a look. > My best bet is the contrary: it starts with small files and makes increasingly larger ones, like the sort utility does. The problem is that there are too many of them at the beginning for it to work with anonymous files (which sort does not use). This at least offers a possible explanation of its getting wedged on large indexes: an unexpected and untested error, handled poorly. > > -Original Message- > From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] > On Behalf Of Kevin O'Gorman > Sent: Friday, August 05, 2016 3:41 PM > To: SQLite mailing list > Subject: Re: [sqlite] newbie has waited days for a DB build to complete. > what's up with this. > > On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote: > > > Hi, Kevin, > > > > On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com> > > wrote: > > > Okay, I followed some of the advice y'all gave and got some results. > > > > > > 1. The original problem was compromised by malformed input. However, > it > > > appears that did not cause the wedging of the process. See (3) below. > > > > Where are the data will come? > > From the user? Internet? > > > > What I'm getting at is - you need to look for some malformed data in > > the future as well. > > > > I generate it. I goofed, and I'll try not to goof in the future. > > > > > > > > > > 2. I separated the steps, and started small. Time increased slightly > > > sub-linearly with dataset size, so I jumped to doing the whole thing. > > With > > > proper input, the data was loaded in 68 minutes. > > > > > > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting > "database > > > or disk is full" which seemed odd since most of my partitions have much > > > more free space than the entire database. It turns out that whatever > > does > > > the creation was using space on my root partition (this is Linux, so > that > > > means "/"). That's the only partition in my setup without a huge > amount > > of > > > free space. On would expect temporary stuff to go to /tmp (which has > 3TB > > > free), but it doesn't go there. I would go there if the system's > native > > > "sort" program were used. Fortunately, it turns out that the TMPDIR > > > environment variable is honored, but while I could see space was being > > > used, there were no files visible. I take that to mean that the > > tmpfile() > > > function (or equivalent) was used. This could be a bad idea for large > > > indexes because anonymous files have to be kept open, and there's a > limit > > > on the number of files that can be open at a time, around 1,000. Sure > > > enough, the index creation appears to be wedged like the original run, > > and > > > after a few hours I killed it manually. This is a deal-killer. > > > > The failure you saw - is it on the table with the complete data set? > > Or you got it during the experimenting? > > > > Only on the complete data set. > > > > > > > > So the questions are: Where do bug reports go? I seem to be running > > 3.8.2; > > > is this fixed in any later version? > > > > You can try the "3.14" pre-released one right now. ;-) > > > > Meh. I submitted a bug report to this list. I'll see what happens. > > > > Thank you. > > > > > > > > > > > On Thu, Aug 4, 2016 at 9:27 AM,
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
There's a depreciated pragma, PRAGMA temp_store_directory = 'directory-name'; that apears to still work ok for now at least. http://www.sqlite.org/pragma.html#pragma_temp_store_directory I've used it for the same reasons you've both mentioned about space, though for me it vacuuming a huge db, and the full size db copy that makes which ate up my entire C drive. But with that pragma you can specify exactly which folder you want to use for your temp files. I'd suggest try using that, then monitor the folder you give it to see what shows up. Apart from the default location of the files, it reads like your next main concern is how many temp files get opened up. My bet is that it'll be a very small number, just potentially huge in file size while it's doing its thing. But again, try that pragma and take a look. -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Kevin O'Gorman Sent: Friday, August 05, 2016 3:41 PM To: SQLite mailing list Subject: Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this. On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote: > Hi, Kevin, > > On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com> > wrote: > > Okay, I followed some of the advice y'all gave and got some results. > > > > 1. The original problem was compromised by malformed input. However, it > > appears that did not cause the wedging of the process. See (3) below. > > Where are the data will come? > From the user? Internet? > > What I'm getting at is - you need to look for some malformed data in > the future as well. > I generate it. I goofed, and I'll try not to goof in the future. > > > > > 2. I separated the steps, and started small. Time increased slightly > > sub-linearly with dataset size, so I jumped to doing the whole thing. > With > > proper input, the data was loaded in 68 minutes. > > > > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database > > or disk is full" which seemed odd since most of my partitions have much > > more free space than the entire database. It turns out that whatever > does > > the creation was using space on my root partition (this is Linux, so that > > means "/"). That's the only partition in my setup without a huge amount > of > > free space. On would expect temporary stuff to go to /tmp (which has 3TB > > free), but it doesn't go there. I would go there if the system's native > > "sort" program were used. Fortunately, it turns out that the TMPDIR > > environment variable is honored, but while I could see space was being > > used, there were no files visible. I take that to mean that the > tmpfile() > > function (or equivalent) was used. This could be a bad idea for large > > indexes because anonymous files have to be kept open, and there's a limit > > on the number of files that can be open at a time, around 1,000. Sure > > enough, the index creation appears to be wedged like the original run, > and > > after a few hours I killed it manually. This is a deal-killer. > > The failure you saw - is it on the table with the complete data set? > Or you got it during the experimenting? > > Only on the complete data set. > > > > So the questions are: Where do bug reports go? I seem to be running > 3.8.2; > > is this fixed in any later version? > > You can try the "3.14" pre-released one right now. ;-) > Meh. I submitted a bug report to this list. I'll see what happens. > Thank you. > > > > > > > On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman <kevinogorm...@gmail.com> > > wrote: > > > >> The metric for feasability is coding ease, not runtime. I'm the > >> bottleneck, not the machine, at least at this point. > >> > >> As for adding rows, it will be about like this time: a billion or so at > a > >> time. But there's no need to save the old data. Each round can be > >> separate except for a persistent "solutions" table of much more modest > >> size. I've been doing this for a while now, and the solutions file has > >> only 10 million or so lines, each representing a game position for which > >> optimum moves are known. Getting this file to include the starting > >> position is the point of the exercise. > >> > >> If I ever get to anything like "production" in this project, I expect it > >> to run for maybe three years... That's after I tweak it for speed. > >> > >> Background: in production, this will be running on a dual-Xe
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Fri, Aug 5, 2016 at 12:30 PM, Igor Korotwrote: > Hi, Kevin, > > On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman > wrote: > > Okay, I followed some of the advice y'all gave and got some results. > > > > 1. The original problem was compromised by malformed input. However, it > > appears that did not cause the wedging of the process. See (3) below. > > Where are the data will come? > From the user? Internet? > > What I'm getting at is - you need to look for some malformed data in > the future as well. > I generate it. I goofed, and I'll try not to goof in the future. > > > > > 2. I separated the steps, and started small. Time increased slightly > > sub-linearly with dataset size, so I jumped to doing the whole thing. > With > > proper input, the data was loaded in 68 minutes. > > > > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database > > or disk is full" which seemed odd since most of my partitions have much > > more free space than the entire database. It turns out that whatever > does > > the creation was using space on my root partition (this is Linux, so that > > means "/"). That's the only partition in my setup without a huge amount > of > > free space. On would expect temporary stuff to go to /tmp (which has 3TB > > free), but it doesn't go there. I would go there if the system's native > > "sort" program were used. Fortunately, it turns out that the TMPDIR > > environment variable is honored, but while I could see space was being > > used, there were no files visible. I take that to mean that the > tmpfile() > > function (or equivalent) was used. This could be a bad idea for large > > indexes because anonymous files have to be kept open, and there's a limit > > on the number of files that can be open at a time, around 1,000. Sure > > enough, the index creation appears to be wedged like the original run, > and > > after a few hours I killed it manually. This is a deal-killer. > > The failure you saw - is it on the table with the complete data set? > Or you got it during the experimenting? > > Only on the complete data set. > > > > So the questions are: Where do bug reports go? I seem to be running > 3.8.2; > > is this fixed in any later version? > > You can try the "3.14" pre-released one right now. ;-) > Meh. I submitted a bug report to this list. I'll see what happens. > Thank you. > > > > > > > On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman > > wrote: > > > >> The metric for feasability is coding ease, not runtime. I'm the > >> bottleneck, not the machine, at least at this point. > >> > >> As for adding rows, it will be about like this time: a billion or so at > a > >> time. But there's no need to save the old data. Each round can be > >> separate except for a persistent "solutions" table of much more modest > >> size. I've been doing this for a while now, and the solutions file has > >> only 10 million or so lines, each representing a game position for which > >> optimum moves are known. Getting this file to include the starting > >> position is the point of the exercise. > >> > >> If I ever get to anything like "production" in this project, I expect it > >> to run for maybe three years... That's after I tweak it for speed. > >> > >> Background: in production, this will be running on a dual-Xeon with 16 > >> cores (32 hyperthreads) and 1/4 TiB RAM. It has sequential file update > >> through Linux flock() calls at the moment. The code is bash gluing > >> together a collection of UNIX utilities and some custom C code. The C > is > >> kept as simple as possible, to minimize errors. > >> > >> As you may surmise, this "hobby" is important to me. > >> > >> > >> On Thu, Aug 4, 2016 at 9:09 AM, R Smith wrote: > >> > >>> > >>> > >>> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote: > >>> > On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne < > ddevie...@gmail.com> > wrote: > > > It's even less dense than that. Each character has only 3 possible > values, > and thus it's pretty easy to compress down to 2 bits each, for a 16 > byte > blob. > It's just hard to do that without a bunch of SQLite code I'd have to > learn > how to write. The current effort amounts to a feasibility study, and > I > want > to keep it as simple as possible. > > >>> > >>> A feasibility study using equipment that are hamstrung by weights they > >>> won't have in the real situation is not an accurate study. > >>> > >>> It's like studying fuel consumption on a different kind of road > surface, > >>> but for the test purposes, the cars had to tow caravans containing > their > >>> testing equipment - the study will not look feasible at all. > >>> > >>> It might of course be that the feasibility you are studying is > completely > >>> unrelated to the data handling - in which case the point is moot. > >>> > >>> Let us know
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
Hi, Kevin, On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gormanwrote: > Okay, I followed some of the advice y'all gave and got some results. > > 1. The original problem was compromised by malformed input. However, it > appears that did not cause the wedging of the process. See (3) below. Where are the data will come? From the user? Internet? What I'm getting at is - you need to look for some malformed data in the future as well. > > 2. I separated the steps, and started small. Time increased slightly > sub-linearly with dataset size, so I jumped to doing the whole thing. With > proper input, the data was loaded in 68 minutes. > > 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database > or disk is full" which seemed odd since most of my partitions have much > more free space than the entire database. It turns out that whatever does > the creation was using space on my root partition (this is Linux, so that > means "/"). That's the only partition in my setup without a huge amount of > free space. On would expect temporary stuff to go to /tmp (which has 3TB > free), but it doesn't go there. I would go there if the system's native > "sort" program were used. Fortunately, it turns out that the TMPDIR > environment variable is honored, but while I could see space was being > used, there were no files visible. I take that to mean that the tmpfile() > function (or equivalent) was used. This could be a bad idea for large > indexes because anonymous files have to be kept open, and there's a limit > on the number of files that can be open at a time, around 1,000. Sure > enough, the index creation appears to be wedged like the original run, and > after a few hours I killed it manually. This is a deal-killer. The failure you saw - is it on the table with the complete data set? Or you got it during the experimenting? > > So the questions are: Where do bug reports go? I seem to be running 3.8.2; > is this fixed in any later version? You can try the "3.14" pre-released one right now. ;-) Thank you. > > > On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman > wrote: > >> The metric for feasability is coding ease, not runtime. I'm the >> bottleneck, not the machine, at least at this point. >> >> As for adding rows, it will be about like this time: a billion or so at a >> time. But there's no need to save the old data. Each round can be >> separate except for a persistent "solutions" table of much more modest >> size. I've been doing this for a while now, and the solutions file has >> only 10 million or so lines, each representing a game position for which >> optimum moves are known. Getting this file to include the starting >> position is the point of the exercise. >> >> If I ever get to anything like "production" in this project, I expect it >> to run for maybe three years... That's after I tweak it for speed. >> >> Background: in production, this will be running on a dual-Xeon with 16 >> cores (32 hyperthreads) and 1/4 TiB RAM. It has sequential file update >> through Linux flock() calls at the moment. The code is bash gluing >> together a collection of UNIX utilities and some custom C code. The C is >> kept as simple as possible, to minimize errors. >> >> As you may surmise, this "hobby" is important to me. >> >> >> On Thu, Aug 4, 2016 at 9:09 AM, R Smith wrote: >> >>> >>> >>> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote: >>> On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne wrote: It's even less dense than that. Each character has only 3 possible values, and thus it's pretty easy to compress down to 2 bits each, for a 16 byte blob. It's just hard to do that without a bunch of SQLite code I'd have to learn how to write. The current effort amounts to a feasibility study, and I want to keep it as simple as possible. >>> >>> A feasibility study using equipment that are hamstrung by weights they >>> won't have in the real situation is not an accurate study. >>> >>> It's like studying fuel consumption on a different kind of road surface, >>> but for the test purposes, the cars had to tow caravans containing their >>> testing equipment - the study will not look feasible at all. >>> >>> It might of course be that the feasibility you are studying is completely >>> unrelated to the data handling - in which case the point is moot. >>> >>> Let us know how it goes :) >>> Ryan >>> >>> ___ >>> sqlite-users mailing list >>> sqlite-users@mailinglists.sqlite.org >>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >>> >> >> >> >> -- >> #define QUESTION ((bb) || (!bb)) /* Shakespeare */ >> > > > > -- > #define QUESTION ((bb) || (!bb)) /* Shakespeare */ > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org >
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
Okay, I followed some of the advice y'all gave and got some results. 1. The original problem was compromised by malformed input. However, it appears that did not cause the wedging of the process. See (3) below. 2. I separated the steps, and started small. Time increased slightly sub-linearly with dataset size, so I jumped to doing the whole thing. With proper input, the data was loaded in 68 minutes. 3. The CREATE INDEX steps failed quickly (2 minutes), reporting "database or disk is full" which seemed odd since most of my partitions have much more free space than the entire database. It turns out that whatever does the creation was using space on my root partition (this is Linux, so that means "/"). That's the only partition in my setup without a huge amount of free space. On would expect temporary stuff to go to /tmp (which has 3TB free), but it doesn't go there. I would go there if the system's native "sort" program were used. Fortunately, it turns out that the TMPDIR environment variable is honored, but while I could see space was being used, there were no files visible. I take that to mean that the tmpfile() function (or equivalent) was used. This could be a bad idea for large indexes because anonymous files have to be kept open, and there's a limit on the number of files that can be open at a time, around 1,000. Sure enough, the index creation appears to be wedged like the original run, and after a few hours I killed it manually. This is a deal-killer. So the questions are: Where do bug reports go? I seem to be running 3.8.2; is this fixed in any later version? On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gormanwrote: > The metric for feasability is coding ease, not runtime. I'm the > bottleneck, not the machine, at least at this point. > > As for adding rows, it will be about like this time: a billion or so at a > time. But there's no need to save the old data. Each round can be > separate except for a persistent "solutions" table of much more modest > size. I've been doing this for a while now, and the solutions file has > only 10 million or so lines, each representing a game position for which > optimum moves are known. Getting this file to include the starting > position is the point of the exercise. > > If I ever get to anything like "production" in this project, I expect it > to run for maybe three years... That's after I tweak it for speed. > > Background: in production, this will be running on a dual-Xeon with 16 > cores (32 hyperthreads) and 1/4 TiB RAM. It has sequential file update > through Linux flock() calls at the moment. The code is bash gluing > together a collection of UNIX utilities and some custom C code. The C is > kept as simple as possible, to minimize errors. > > As you may surmise, this "hobby" is important to me. > > > On Thu, Aug 4, 2016 at 9:09 AM, R Smith wrote: > >> >> >> On 2016/08/04 5:56 PM, Kevin O'Gorman wrote: >> >>> On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne >>> wrote: >>> >>> >>> It's even less dense than that. Each character has only 3 possible >>> values, >>> and thus it's pretty easy to compress down to 2 bits each, for a 16 byte >>> blob. >>> It's just hard to do that without a bunch of SQLite code I'd have to >>> learn >>> how to write. The current effort amounts to a feasibility study, and I >>> want >>> to keep it as simple as possible. >>> >> >> A feasibility study using equipment that are hamstrung by weights they >> won't have in the real situation is not an accurate study. >> >> It's like studying fuel consumption on a different kind of road surface, >> but for the test purposes, the cars had to tow caravans containing their >> testing equipment - the study will not look feasible at all. >> >> It might of course be that the feasibility you are studying is completely >> unrelated to the data handling - in which case the point is moot. >> >> Let us know how it goes :) >> Ryan >> >> ___ >> sqlite-users mailing list >> sqlite-users@mailinglists.sqlite.org >> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users >> > > > > -- > #define QUESTION ((bb) || (!bb)) /* Shakespeare */ > -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
The metric for feasability is coding ease, not runtime. I'm the bottleneck, not the machine, at least at this point. As for adding rows, it will be about like this time: a billion or so at a time. But there's no need to save the old data. Each round can be separate except for a persistent "solutions" table of much more modest size. I've been doing this for a while now, and the solutions file has only 10 million or so lines, each representing a game position for which optimum moves are known. Getting this file to include the starting position is the point of the exercise. If I ever get to anything like "production" in this project, I expect it to run for maybe three years... That's after I tweak it for speed. Background: in production, this will be running on a dual-Xeon with 16 cores (32 hyperthreads) and 1/4 TiB RAM. It has sequential file update through Linux flock() calls at the moment. The code is bash gluing together a collection of UNIX utilities and some custom C code. The C is kept as simple as possible, to minimize errors. As you may surmise, this "hobby" is important to me. On Thu, Aug 4, 2016 at 9:09 AM, R Smithwrote: > > > On 2016/08/04 5:56 PM, Kevin O'Gorman wrote: > >> On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne >> wrote: >> >> >> It's even less dense than that. Each character has only 3 possible >> values, >> and thus it's pretty easy to compress down to 2 bits each, for a 16 byte >> blob. >> It's just hard to do that without a bunch of SQLite code I'd have to learn >> how to write. The current effort amounts to a feasibility study, and I >> want >> to keep it as simple as possible. >> > > A feasibility study using equipment that are hamstrung by weights they > won't have in the real situation is not an accurate study. > > It's like studying fuel consumption on a different kind of road surface, > but for the test purposes, the cars had to tow caravans containing their > testing equipment - the study will not look feasible at all. > > It might of course be that the feasibility you are studying is completely > unrelated to the data handling - in which case the point is moot. > > Let us know how it goes :) > Ryan > > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 2016/08/04 5:56 PM, Kevin O'Gorman wrote: On Thu, Aug 4, 2016 at 8:29 AM, Dominique Deviennewrote: It's even less dense than that. Each character has only 3 possible values, and thus it's pretty easy to compress down to 2 bits each, for a 16 byte blob. It's just hard to do that without a bunch of SQLite code I'd have to learn how to write. The current effort amounts to a feasibility study, and I want to keep it as simple as possible. A feasibility study using equipment that are hamstrung by weights they won't have in the real situation is not an accurate study. It's like studying fuel consumption on a different kind of road surface, but for the test purposes, the cars had to tow caravans containing their testing equipment - the study will not look feasible at all. It might of course be that the feasibility you are studying is completely unrelated to the data handling - in which case the point is moot. Let us know how it goes :) Ryan ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Thu, Aug 4, 2016 at 8:29 AM, Dominique Deviennewrote: > On Thu, Aug 4, 2016 at 5:05 PM, Kevin O'Gorman > wrote: > > > 3. Positions are 64 bytes always, so your size guesses are right. They > are > > in no particular order. I like the suggestion of a separate position > > table, because they're going to appear in multiple qmove records, with an > > average of about 3 or 4 appearances I think. Maybe more. > > > > 3x or 4x duplication doesn't sound like a lot. What you'd gain in the moves > table, > you'd lose having to join to the positions table I suspect. Higher level > duplicates, maybe. > > 64-bytes always? Is that a human readable string, or some 'code' in > hexadecimal? > If the latter, use a blob, which requires only 32-bytes to store the same > info. You can > use the blob literal notation x'abcdef01' (that's a 4-bytes blob). > It's even less dense than that. Each character has only 3 possible values, and thus it's pretty easy to compress down to 2 bits each, for a 16 byte blob. It's just hard to do that without a bunch of SQLite code I'd have to learn how to write. The current effort amounts to a feasibility study, and I want to keep it as simple as possible. > Finally, note that if your program writes a huge text file with all your > values, that you > .import into sqlite3 as you showed, you're IMHO wasting time, since you > can't use > prepared statements and binds, and you also force SQLite's SQL parser to > parse > a huge amount of text. By embedding SQLite into your generator program, you > remove all parsing except for a trivial "insert into qmoves values (:1, :2, > ...)", and > all the rest is sqlite_bind*() and co. calls. (and if blob situation for > positions, then > you can bind the 32-bytes blob directly, no need to convert/parse to/from > hex). > I understand the concept of prepared statements in principle, sort of, don't how binds work really, so I'm not quite ready to write the code you allude to. And I'd no longer be able to use sqlite3 at all to do simple experiments -- I'd have to be blobbing and de-blobbing to make sense of anything. > > My $0.02. --DD > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 2016/08/04 5:05 PM, Kevin O'Gorman wrote: Lots of answers, so thanks all around. Some more info: 1. All partitions have at least 3 GB free, and it's not changing. /tmp is 3 TiB and empty. 2. I have a RAID partition, for size, but no RAID controller. As a hobby project, I don't have spare parts, and I fear the results of a failure of a hardware RAID without a spare, so I use Linux mdadm to manage software RAID across three 4-TB drives. 3. Positions are 64 bytes always, so your size guesses are right. They are in no particular order. I like the suggestion of a separate position table, because they're going to appear in multiple qmove records, with an average of about 3 or 4 appearances I think. Maybe more. I'm going to retry, using some of the suggestions above: smaller transactions, IGNORE, position table rowids in the moves table, smaller subsets being loaded, developing ideas of how time depends on data size. If it doesn't go well fairly quickly, I'll probably go back to flat files and writing the queries the hard way. At least I know what I'm dealing with there. Note that there is nothing about SQLite that isn't working extremely well for your purpose - going back to flat files is akin to going back to the dark ages. Databases of this size (and even much larger) work perfectly all around with some people here able to quote you more staggering figures even - but it does require some special processing which might be thwarted by your current system layout or design choices. however, if done right, it should not be significantly slower than writing flat files - so try to persist to find what "done right" entails. Remember that once this has worked - you can query the data with SQL... that is an amazingly powerful feature over flat files, and once an index exists (however slow it took to make), queries will be lightning fast, something a flat-file can never give you. QUESTIONS: If it's best in general to build indexes as a separate step, does this also apply to primary indexes? Can a table without a primary index have one added later? Isn't ROWID the real primary, presuming it has ROWIDs? And if so, then is a primary index on a ROWID table just for compliance with standard SQL, and really no better than any other index? Obviously, I'm a bit confused about this. A primary index is nothing other than a normal Unique index which has a guaranteed 1-to-1 key-to-row ratio and has look-up priority status. All SQL tables MUST have such a key to be able to guarantee access to any individual row, and if you omit the primary key bit, then some internal mechanism is used for it - in SQLite's case, this mechanism is called the row_id. A nice recent feature of SQLite allows you to get rid of this row_id overhead by explicitly specifying a Primary Key and then WITHOUT ROWID after the Table declaration. The advantage of NOT using an explicit Primary key from the start is that the row_id can simply be added by increments during insert statements due to its auto-supplied-by-the-db-engine and INT type nature. A primary key has to be sorted (or at least, has to determine the correct sort-order insert position) for every added row. This costs a good bit of time on really large insert operations. The bulky single sort operation while building the PK index after-the-fact takes a lot less time than the repeated look-up-insert operations for the key while making the table. I hope this answers the question, but feel free to ask more if anything remains unclear. While I'm at it, I may as well ask if ROWID has any physical significance, such that a VACUUM operation might change it. Or is it just an arbitrary ID inserted by SQLite and added to each record when they exist at all. It is arbitrary, it is supplied by the DB engine in general, but it will never be changed by anything! That would go against all SQL and RDBMS premises. You could supply it yourself to (and I often advocate this to be the better practice). If you declare a table with a primary key specified /exactly/ like this: CREATE myTable ("ID" INTEGER PRIMARY KEY, ) Then the "ID" in this case becomes an alias for the row_id and manipulating/reading the value in it is in fact reading / manipulating the actual row_id. (The "ID" can be anything else you like, but the "INTEGER PRIMARY KEY" part needs to be exactly written like that). The current dataset is intended to solve one particular issue in the overall project. It looks like I'd want to build each such dataset separately, as there will likely be a few hundred, and I gather that adding to these tables will be pretty slow once the indexes have been built. Or is it sensible to drop indexes, add data and rebuild? No this is not true - it will be really fast to add another few rows... it's just slow when you add the initial few zillion rows due to simple laws of quantity. If however you will be adding rows at an amazing rate, I
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Thu, Aug 4, 2016 at 5:29 PM, Dominique Deviennewrote: > [...] you also force SQLite's SQL parser to parse a huge amount of text. > [...] > OK, maybe not the SQL parser, depends what you write out and the .import mode (I guess, didn't look into the details). But for sure "some" parser (CSV, SQL, or else...) So I'm suggesting to eliminate the text "middle man", and use SQLite directly with native values (ints, strings, blobs, whatever). Of course I assume your program is C/C++, which given your sizes, it better be :). --DD ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On Thu, Aug 4, 2016 at 5:05 PM, Kevin O'Gormanwrote: > 3. Positions are 64 bytes always, so your size guesses are right. They are > in no particular order. I like the suggestion of a separate position > table, because they're going to appear in multiple qmove records, with an > average of about 3 or 4 appearances I think. Maybe more. > 3x or 4x duplication doesn't sound like a lot. What you'd gain in the moves table, you'd lose having to join to the positions table I suspect. Higher level duplicates, maybe. 64-bytes always? Is that a human readable string, or some 'code' in hexadecimal? If the latter, use a blob, which requires only 32-bytes to store the same info. You can use the blob literal notation x'abcdef01' (that's a 4-bytes blob). Finally, note that if your program writes a huge text file with all your values, that you .import into sqlite3 as you showed, you're IMHO wasting time, since you can't use prepared statements and binds, and you also force SQLite's SQL parser to parse a huge amount of text. By embedding SQLite into your generator program, you remove all parsing except for a trivial "insert into qmoves values (:1, :2, ...)", and all the rest is sqlite_bind*() and co. calls. (and if blob situation for positions, then you can bind the 32-bytes blob directly, no need to convert/parse to/from hex). My $0.02. --DD ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
Lots of answers, so thanks all around. Some more info: 1. All partitions have at least 3 GB free, and it's not changing. /tmp is 3 TiB and empty. 2. I have a RAID partition, for size, but no RAID controller. As a hobby project, I don't have spare parts, and I fear the results of a failure of a hardware RAID without a spare, so I use Linux mdadm to manage software RAID across three 4-TB drives. 3. Positions are 64 bytes always, so your size guesses are right. They are in no particular order. I like the suggestion of a separate position table, because they're going to appear in multiple qmove records, with an average of about 3 or 4 appearances I think. Maybe more. I'm going to retry, using some of the suggestions above: smaller transactions, IGNORE, position table rowids in the moves table, smaller subsets being loaded, developing ideas of how time depends on data size. If it doesn't go well fairly quickly, I'll probably go back to flat files and writing the queries the hard way. At least I know what I'm dealing with there. QUESTIONS: If it's best in general to build indexes as a separate step, does this also apply to primary indexes? Can a table without a primary index have one added later? Isn't ROWID the real primary, presuming it has ROWIDs? And if so, then is a primary index on a ROWID table just for compliance with standard SQL, and really no better than any other index? Obviously, I'm a bit confused about this. While I'm at it, I may as well ask if ROWID has any physical significance, such that a VACUUM operation might change it. Or is it just an arbitrary ID inserted by SQLite and added to each record when they exist at all. The current dataset is intended to solve one particular issue in the overall project. It looks like I'd want to build each such dataset separately, as there will likely be a few hundred, and I gather that adding to these tables will be pretty slow once the indexes have been built. Or is it sensible to drop indexes, add data and rebuild? On Thu, Aug 4, 2016 at 7:27 AM, Jim Callahanwrote: > Temp Files > Have you checked how much storage is available to the temporary file > locations? > The temporary file locations are different depending on the OS, build, VFS > and PRAGMA settings. > See the last section "5.0 Temporary File Storage Locations" of: > https://www.sqlite.org/tempfiles.html > > > The database was growing for about 1-1/2 days. Then its journal > > disappeared, the file size dropped to zero, but sqlite3 is still running > > 100% CPU time, now for a total of 3800+ minutes (63+ hours). The > database > > is still locked, but I have no idea what sqlite3 is doing, or if it will > > ever stop. All partitions still have lots of space left (most of this is > > running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on > > my Linux system: > > > You might have a huge storage allocation for the main file and log, but > some other temp file might be being dumped > to a more constrained storage location. > > RAM > Since you are using RAID disk controller; I assume you have 64 bit CPU and > more than 8 GB of RAM? > If you have 8 GB or more of RAM would it help to use an in memory database? > > Transactions > Are you using explicit or implicit transactions? > https://www.sqlite.org/lang_transaction.html > > > Steps > Agree with Darren Duncan and Dr. Hipp you may want to have at least 3 > separate steps > (each step should be a separate transaction): > > 1. Simple load > 2. Create additional column > 3. Create index > > Have you pre-defined the table you are loading data into? (step 0 CREATE > TABLE) > > If "Step 1 Simple Load" does not complete; then may want to load a fixed > number of rows into separate tables (per Darren Duncan) and then combine > using an APPEND > or a UNION query (doing so before steps 2 and 3). > > HTH > > Jim Callahan > Data Scientist > Orlando, FL > > > > > On Wed, Aug 3, 2016 at 11:00 PM, Kevin O'Gorman > wrote: > > > I'm working on a hobby project, but the data has gotten a bit out of > hand. > > I thought I'd put it in a real database rather than flat ASCII files. > > > > I've got a problem set of about 1 billion game positions and 187GB to > work > > on (no, I won't have to solve them all) that took about 4 hours for a > > generator program just to write. I wrote code to turn them into > something > > SQLite could import. Actually, it's import, build a non-primary index, > and > > alter table to add a column, all in sqlite3. > > > > The database was growing for about 1-1/2 days. Then its journal > > disappeared, the file size dropped to zero, but sqlite3 is still running > > 100% CPU time, now for a total of 3800+ minutes (63+ hours). The > database > > is still locked, but I have no idea what sqlite3 is doing, or if it will > > ever stop. All partitions still have lots of space left (most of this is > > running in a RAID partition of 11 TiB).
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
Temp Files Have you checked how much storage is available to the temporary file locations? The temporary file locations are different depending on the OS, build, VFS and PRAGMA settings. See the last section "5.0 Temporary File Storage Locations" of: https://www.sqlite.org/tempfiles.html The database was growing for about 1-1/2 days. Then its journal > disappeared, the file size dropped to zero, but sqlite3 is still running > 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database > is still locked, but I have no idea what sqlite3 is doing, or if it will > ever stop. All partitions still have lots of space left (most of this is > running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on > my Linux system: You might have a huge storage allocation for the main file and log, but some other temp file might be being dumped to a more constrained storage location. RAM Since you are using RAID disk controller; I assume you have 64 bit CPU and more than 8 GB of RAM? If you have 8 GB or more of RAM would it help to use an in memory database? Transactions Are you using explicit or implicit transactions? https://www.sqlite.org/lang_transaction.html Steps Agree with Darren Duncan and Dr. Hipp you may want to have at least 3 separate steps (each step should be a separate transaction): 1. Simple load 2. Create additional column 3. Create index Have you pre-defined the table you are loading data into? (step 0 CREATE TABLE) If "Step 1 Simple Load" does not complete; then may want to load a fixed number of rows into separate tables (per Darren Duncan) and then combine using an APPEND or a UNION query (doing so before steps 2 and 3). HTH Jim Callahan Data Scientist Orlando, FL On Wed, Aug 3, 2016 at 11:00 PM, Kevin O'Gormanwrote: > I'm working on a hobby project, but the data has gotten a bit out of hand. > I thought I'd put it in a real database rather than flat ASCII files. > > I've got a problem set of about 1 billion game positions and 187GB to work > on (no, I won't have to solve them all) that took about 4 hours for a > generator program just to write. I wrote code to turn them into something > SQLite could import. Actually, it's import, build a non-primary index, and > alter table to add a column, all in sqlite3. > > The database was growing for about 1-1/2 days. Then its journal > disappeared, the file size dropped to zero, but sqlite3 is still running > 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database > is still locked, but I have no idea what sqlite3 is doing, or if it will > ever stop. All partitions still have lots of space left (most of this is > running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on > my Linux system: > > time sqlite3 qubic.db < BEGIN EXCLUSIVE TRANSACTION; > DROP TABLE IF EXISTS qmoves; > CREATE TABLE qmoves ( > qfrom CHAR(64), > qmove INT, > qto CHAR(64), > qweight INT, > PRIMARY KEY (qfrom, qmove) ON CONFLICT ROLLBACK > ); > CREATE INDEX IF NOT EXISTS qmoves_by_dest ON qmoves ( > qto, > qweight > ); > CREATE TABLE IF NOT EXISTS qposn ( > qposn CHAR(64) PRIMARY KEY ON CONFLICT ROLLBACK, > qmaxval INT, > qmove INT, > qminval INT, > qstatus INT > ); > .separator " " > .import am.all qmoves > ALTER TABLE qmoves ADD COLUMN qstatus INT DEFAULT NULL; > .schema > COMMIT TRANSACTION; > > EOF > > Any clues, hints, or advice? > > > -- > #define QUESTION ((bb) || (!bb)) /* Shakespeare */ > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 8/4/16, Wade, Williamwrote: > > I believe that with SQLite, if you don't specify WITHOUT ROWID your "real" > record order is based on rowid, Correct > > In principle, indices can be created by writing the needed information > (index key, record position) in the original order, and then sorting that > into key-order. That can be done with many less random seeks (merge sorts > involve mostly sequential reads and writes). I don't know if, or when, > SQLite does that. > SQLite runs CREATE INDEX commands using an external merge sort, which is what I think you are eluding to above. But if the index already exists, and you are merely inserting new rows into the table, then each index entry is inserted separately. Each such insert is an O(logN) operation. Such isolated inserts typically involve a lot of disk seeks and write amplification. That is why we recommend that you populate large tables first and then run CREATE INDEX, rather than the other way around, whenever practical. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
A lot of speculation here. I am certainly no SQLite expert. Your input has 1g positions, taking 187gb, so averaging 187b/position. From your CREATE TABLE, it looks like to get that size most of your qfrom and qto are fairly long strings. I'm assuming there are a great many duplications in those positions. If so, put them into a separate position table { positionId INT, positionName TEXT }, with positionId as the primary key and positionName also being unique. This will be even more useful if you have a fixed set of possible positions, and you make it so that positionId is increasing whenever positionName is increasing. In your qmoves table, store positionId values, rather than postionName values. Saves a lot of space because no name is in the database more than once, and most of your space is related to names. Space is important, because at a few hundred gb, your database is probably bigger than all of your available fast cache's, and you are probably storing your data on rotating storage. Writes to random positions might easily average 10ms, and the import of most of your records may involve one or more such a writes by the time indices are updated. Reducing sizes mean fewer such writes, because things are more likely to fit in the various caches. I believe that with SQLite, if you don't specify WITHOUT ROWID your "real" record order is based on rowid, so complete records are, perhaps, stored in import order. The PRIMARY KEY you specified is large (a name and an integer), so the index entries are likely about half as large as a complete record, and they don't fit in cache either. It is also likely that your input data was not in sorted order. That means that adding an entry to that index likely involves a write to a random position. 10ms * 1g = 10e6s, or about three months. Your qmoves_by_dest index is about as large as your primary (it also has a position and an integer), so similar timings might apply. I suggest trying timing your script on smaller inputs (1gb, 10gb, 20gb) and seeing if there is a size where things suddenly get worse (available caches are not big enough). See if my suggestions help those timings. See if WITHOUT ROWID helps those timings. In principle, indices can be created by writing the needed information (index key, record position) in the original order, and then sorting that into key-order. That can be done with many less random seeks (merge sorts involve mostly sequential reads and writes). I don't know if, or when, SQLite does that. Regards, Bill -Original Message- From: Kevin O'Gorman [mailto:kevinogorm...@gmail.com] Sent: Wednesday, August 03, 2016 10:00 PM To: sqlite-users Subject: [sqlite] newbie has waited days for a DB build to complete. what's up with this. I'm working on a hobby project, but the data has gotten a bit out of hand. I thought I'd put it in a real database rather than flat ASCII files. I've got a problem set of about 1 billion game positions and 187GB to work on (no, I won't have to solve them all) that took about 4 hours for a generator program just to write. I wrote code to turn them into something SQLite could import. Actually, it's import, build a non-primary index, and alter table to add a column, all in sqlite3. The database was growing for about 1-1/2 days. Then its journal disappeared, the file size dropped to zero, but sqlite3 is still running 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database is still locked, but I have no idea what sqlite3 is doing, or if it will ever stop. All partitions still have lots of space left (most of this is running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on my Linux system: time sqlite3 qubic.db <http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
Hello Kevin, I'd write a utility to do it instead of using the command line tool then add logging to the program in order to note progress. I like the idea of chopping it into smaller parts too. "ON CONFLICT ROLLBACK" You're doing one large transaction and if it rolls back it'll have to undo everything right? I wonder if that's what you're seeing here. You might want to try "IGNORE" to see if you can even run through all the data. Writing a utility for this would let you manage the rollbacks too. C Wednesday, August 3, 2016, 11:00:12 PM, you wrote: KOG> I'm working on a hobby project, but the data has gotten a bit out of hand. KOG> I thought I'd put it in a real database rather than flat ASCII files. KOG> I've got a problem set of about 1 billion game KOG> positions and 187GB to work KOG> on (no, I won't have to solve them all) that took about 4 hours for a KOG> generator program just to write. I wrote code to turn them into something KOG> SQLite could import. Actually, it's import, build a non-primary index, and KOG> alter table to add a column, all in sqlite3. KOG> The database was growing for about 1-1/2 days. Then its journal KOG> disappeared, the file size dropped to zero, but sqlite3 is still running KOG> 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database KOG> is still locked, but I have no idea what sqlite3 is doing, or if it will KOG> ever stop. All partitions still have lots of space left (most of this is KOG> running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on KOG> my Linux system: KOG> time sqlite3 qubic.db < BEGIN EXCLUSIVE TRANSACTION; KOG> DROP TABLE IF EXISTS qmoves; KOG> CREATE TABLE qmoves ( KOG> qfrom CHAR(64), KOG> qmove INT, KOG> qto CHAR(64), KOG> qweight INT, KOG> PRIMARY KEY (qfrom, qmove) ON CONFLICT ROLLBACK KOG> ); KOG> CREATE INDEX IF NOT EXISTS qmoves_by_dest ON qmoves ( KOG> qto, KOG> qweight KOG> ); KOG> CREATE TABLE IF NOT EXISTS qposn ( KOG> qposn CHAR(64) PRIMARY KEY ON CONFLICT ROLLBACK, KOG> qmaxval INT, KOG> qmove INT, KOG> qminval INT, KOG> qstatus INT KOG> ); KOG> .separator " " KOG> .import am.all qmoves KOG> ALTER TABLE qmoves ADD COLUMN qstatus INT DEFAULT NULL; KOG> .schema KOG> COMMIT TRANSACTION; KOG> EOF KOG> Any clues, hints, or advice? -- Tegmailto:t...@djii.com ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
On 4 Aug 2016, at 4:00am, Kevin O'Gormanwrote: > I've got a problem set of about 1 billion game positions and 187GB to work > on (no, I won't have to solve them all) that took about 4 hours for a > generator program just to write. I wrote code to turn them into something > SQLite could import. Actually, it's import, build a non-primary index, and > alter table to add a column, all in sqlite3. Hmm. The closest I have to that is a 43 Gigabyte sqlite database which works fine. In that case almost all the space is taken up by one thin table which has a ridiculous number of rows. > The database was growing for about 1-1/2 days. Then its journal > disappeared, the file size dropped to zero, but sqlite3 is still running I assume that those are what 'ls' is showing you. Have you tried using 'lsof' or 'strace' ? Or monitoring the amount of free space on the disk to see if it's shrinking ? Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.
One way to get a clue is to try doing this in stages. First start over and import a much smaller amount of data, say just a 1GB fraction say, see if that completes, and if it does, how long it takes and other factors like disk and memory etc. If 1GB doesn't work, start smaller yet, until you have a size that does work. When something works, next try something double the size and see if the resource usage is about linear or not. And double it again, etc, and see if you can get a time/space progression figures. That should help you predict how long the full 187GB would take were it successful. Or otherwise at some point you should see the smallest point where the hang occurs. -- Darren Duncan On 2016-08-03 8:00 PM, Kevin O'Gorman wrote: I'm working on a hobby project, but the data has gotten a bit out of hand. I thought I'd put it in a real database rather than flat ASCII files. I've got a problem set of about 1 billion game positions and 187GB to work on (no, I won't have to solve them all) that took about 4 hours for a generator program just to write. I wrote code to turn them into something SQLite could import. Actually, it's import, build a non-primary index, and alter table to add a column, all in sqlite3. The database was growing for about 1-1/2 days. Then its journal disappeared, the file size dropped to zero, but sqlite3 is still running 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database is still locked, but I have no idea what sqlite3 is doing, or if it will ever stop. All partitions still have lots of space left (most of this is running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on my Linux system: ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] newbie has waited days for a DB build to complete. what's up with this.
I'm working on a hobby project, but the data has gotten a bit out of hand. I thought I'd put it in a real database rather than flat ASCII files. I've got a problem set of about 1 billion game positions and 187GB to work on (no, I won't have to solve them all) that took about 4 hours for a generator program just to write. I wrote code to turn them into something SQLite could import. Actually, it's import, build a non-primary index, and alter table to add a column, all in sqlite3. The database was growing for about 1-1/2 days. Then its journal disappeared, the file size dropped to zero, but sqlite3 is still running 100% CPU time, now for a total of 3800+ minutes (63+ hours). The database is still locked, but I have no idea what sqlite3 is doing, or if it will ever stop. All partitions still have lots of space left (most of this is running in a RAID partition of 11 TiB). Here's what I gave to sqlite3 on my Linux system: time sqlite3 qubic.db