Re: [sqlite] row size limit
On Mon, 2004-04-19 at 20:36, D. Richard Hipp wrote: > Mrs. Brisby wrote: > >> > >>The linked-list structure of overflow storage is part of the problem. > >>But the fact that SQLite uses synchronous I/O is also a factor. In > >>order to make BLOBs fast in SQLite, I would have to change to a different > >>indexing technique for overflow storage *and* come up with some kind > >>of cross-platform, asynchronous disk read mechanism. > > > > > > D.R.Morrison (1968)'s PATRICIA would certainly be faster for indexing > > large objects. > > > > A key feature of SQLite 3.0 (needed to support internationalization) > is the ability of users to specify their own comparison functions then > have SQLite use that comparison function to order indices. PATRICIA > does not support user-defined comparison functions. Keys in PATRICIA > must occur in memcmp() order, as far as I am aware. Why not fold the strings at insert time to keep your indexing simple? You can still get internationalization, but require the user supply a function with performs this folding: รก -> a for example. > > Asynchronous read isn't necessary, but vectored reads are. Consider > > readv() POSIX 1003.1-2001 -- in fact, you could probably make > > result-fields return a struct iovec * that would "point" to the value > > within the database. > > > > readv() doesn't help, actually. BLOBs are stored in 1k blocks scattered > all over the file. readv() reads a continguous range of bytes - it > puts those bytes into scattered buffers but the bytes must originate > from a contiguous region of the file. I'd still have to do 1024 > sequential readv()s in order to extract a 1MB blob. My brain fizzled out there for a moment. I don't know where I was. Sadly you're right. While, POSIX 1003.1-2003 does define aio_read(), it still is a portability nightmare. On systems where context switches are cheap, one could use fork() or posix threads to populate a number of pipes, but doubt this would buy much (if anything)... ever. Wouldn't it be nice if poll() actually did something interesting with regular files? :) Looks to me like you can either make two I/O policies (or more), sort your reads/seeks, OR move the blobs into another file :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
Mrs. Brisby wrote: The linked-list structure of overflow storage is part of the problem. But the fact that SQLite uses synchronous I/O is also a factor. In order to make BLOBs fast in SQLite, I would have to change to a different indexing technique for overflow storage *and* come up with some kind of cross-platform, asynchronous disk read mechanism. D.R.Morrison (1968)'s PATRICIA would certainly be faster for indexing large objects. A key feature of SQLite 3.0 (needed to support internationalization) is the ability of users to specify their own comparison functions then have SQLite use that comparison function to order indices. PATRICIA does not support user-defined comparison functions. Keys in PATRICIA must occur in memcmp() order, as far as I am aware. Asynchronous read isn't necessary, but vectored reads are. Consider readv() POSIX 1003.1-2001 -- in fact, you could probably make result-fields return a struct iovec * that would "point" to the value within the database. readv() doesn't help, actually. BLOBs are stored in 1k blocks scattered all over the file. readv() reads a continguous range of bytes - it puts those bytes into scattered buffers but the bytes must originate from a contiguous region of the file. I'd still have to do 1024 sequential readv()s in order to extract a 1MB blob. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
On Mon, 2004-04-19 at 07:04, D. Richard Hipp wrote: > Darren Duncan wrote: > > > > I think the simple answer is that SQLite uses a linked list which can't > > know where a page is until reading the next one, but other databases use > > something other than a linked list; they would trade a bit of complexity > > for speed. -- Darren Duncan > > > > The linked-list structure of overflow storage is part of the problem. > But the fact that SQLite uses synchronous I/O is also a factor. In > order to make BLOBs fast in SQLite, I would have to change to a different > indexing technique for overflow storage *and* come up with some kind > of cross-platform, asynchronous disk read mechanism. D.R.Morrison (1968)'s PATRICIA would certainly be faster for indexing large objects. For those of you without google, PATRICIA is called "Crit-bit trees" by DJB, and "supports the following operations at high speed: * See whether a string x is in the tree. * Add x to the tree. * Remove x from the tree. * Find the lexicographically smallest string in the tree larger than x, if there is one. It essentially works by storing a compressed pointer to the first unequal bit in the key. This means comparisons aren't necessary to traverse the nodes (but they can be needed to add nodes!) http://cr.yp.to/critbit.html http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/ Asynchronous read isn't necessary, but vectored reads are. Consider readv() POSIX 1003.1-2001 -- in fact, you could probably make result-fields return a struct iovec * that would "point" to the value within the database. Finally, it may be worthwhile to finally allow some portions of the database to be stored outside the main file and only store indexes in the main file. This can give huge performance increases for large blobs and wouldn't (necessarily) require a file format change if older programs were prepared for the fact that they might not be able to do anything useful to the value returned :) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
Darren Duncan wrote: I think the simple answer is that SQLite uses a linked list which can't know where a page is until reading the next one, but other databases use something other than a linked list; they would trade a bit of complexity for speed. -- Darren Duncan The linked-list structure of overflow storage is part of the problem. But the fact that SQLite uses synchronous I/O is also a factor. In order to make BLOBs fast in SQLite, I would have to change to a different indexing technique for overflow storage *and* come up with some kind of cross-platform, asynchronous disk read mechanism. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
That info is actually incorrect. WinFS is still scheduled for Longhorn. The tech media misinterpreted something. On Apr 18, 2004, at 8:31 PM, Greg Miller wrote: Greg Obleshchuk wrote: I know the MS is looking at replacing the file system with the SQL engine in Longhorn so they must have solved the issue. They're not replacing NTFS with a database. They're implementing a database layer (WinFS) on top of NTFS. It's not entirely clear what they're doing, but apparently they index a bunch of metadata in order to help manage files. On a related note, this has apparently been pushed back to a post-Longhorn release so that they can keep Longhorn on track for 2006. That, at least, is what the tech media is reporting. -- http://www.classic-games.com/ http://www.indie-games.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
Greg Obleshchuk wrote: I know the MS is looking at replacing the file system with the SQL engine in Longhorn so they must have solved the issue. They're not replacing NTFS with a database. They're implementing a database layer (WinFS) on top of NTFS. It's not entirely clear what they're doing, but apparently they index a bunch of metadata in order to help manage files. On a related note, this has apparently been pushed back to a post-Longhorn release so that they can keep Longhorn on track for 2006. That, at least, is what the tech media is reporting. -- http://www.classic-games.com/ http://www.indie-games.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
These disk access issues are why no database I know of actually stores large objects inline. It would be crazy to do so. mysql, postgres, and oracle all have support for blobs, and none of them store them inline. (btw, if you care about disk io performance for blobs, you can tune the fs parameters for such large files.) blobs are a pain no matter how you cut it. i've built different applications with them in the database, and without (storing just filenames, with files maintained by the app, as drh suggested). I've regretted both approaches :(. The problems I've seen with not using database blobs: 1. kiss replication goodbye. If you are careful with file naming, and never modify blobs, and use rsync, you can just barely get by. 2. You can't do text matching as part of a sql request. You have to do your own external text indexing and join at the app layer. 3. You have to implement your own application-level transaction logic, so that during updates and deletes of both database and file system are inconsistent. Except that is hard to do right, so over time the two *will* get inconsistent. 4. You have to have a shared file system for your database clients. Which means NFS or samba, or something else which is painful to set up and administer and is difficult to secure over a WAN. The problems I've seen with using database blobs: 1. The blob-specific APIs are also usually database-specific. They are typically poorly designed. Now with JDBC3/ODBC3, they are at least not database-specific 2. The semantics of blob modification within a larger transaction is usually poorly documented, or buggy, or both. 3. You don't get to play with the blobs in the external file system when you want to (even if for read only purposes). 4. Performance can still stink, unless you are careful. At the wire level, the fetch blocks usually only contain "locator" objects, and these are converted to data streams only when asked, and that usually requires a new independent roundtrip and set up for every blob. So you end up doing things like using sql functions which fetch the first 1000 bytes of the blobs (if that is what your app wants), so that they are streamed with the rest of the fetch. -mda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
At 10:59 AM +1000 4/19/04, Greg Obleshchuk wrote: I guess it would depend on the system. I assume (and may ask) that MS SQL and Oracle use multi-threaded processes to access the information and that is the way they get around it. I know the MS is looking at replacing the file system with the SQL engine in Longhorn so they must have solved the issue. The speed that they can generate from queries would indicate that they have solved the issue. But perhaps the smaller systems PostgreSQL and/or MySQL don't. I think the simple answer is that SQLite uses a linked list which can't know where a page is until reading the next one, but other databases use something other than a linked list; they would trade a bit of complexity for speed. -- Darren Duncan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
At 7:50 PM -0400 4/18/04, D. Richard Hipp wrote: Suppose you have a 1MB row in SQLite and you want to read the whole thing. SQLite must first ask for the 1st 1K page and wait for it to be retrieved. Then it asks for the 2nd 1K page and waits for it. And so forth for all 1000+ pages. If each page retrieval requires 1 rotation of the disk, that's about 8.5 seconds on a 7200RPM disk drive. I assume that the reason for this is that you don't know in advance which are all the pages that a row is stored in, such that you could just ask the database to read 1000 blocks; you would have to look at the overflow indicator within a block to know where the next one is. While that may be the case now, I would like to suggest an optimization feature for 3.0 that could work around the problem and make working with large rows much faster. This suggestion has at least two variations to choose from; hopefully each would be simple to implement. ... I'm suddenly on the clock, so I'll continue now briefly, and possibly expand later if there are questions. 1. Add an extra detail or two in references to overflow pages (usually just to the first overflow page) which say that the first/next N overflow pages are in contiguous blocks. Then, when reading, you can ask the operating system to return that many pages all at once. Presumably this would work if the entire row or large value is written at once such as on an insert; you can specify that the row is stored in contiguous pages to begin with. Something along those lines. -- Darren Duncan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
>>Indeed. But I wonder if most all databases do it the same way? Or do >>all file-based dbs do it the same way? etc. I guess it would depend on the system. I assume (and may ask) that MS SQL and Oracle use multi-threaded processes to access the information and that is the way they get around it. I know the MS is looking at replacing the file system with the SQL engine in Longhorn so they must have solved the issue. The speed that they can generate from queries would indicate that they have solved the issue. But perhaps the smaller systems PostgreSQL and/or MySQL don't. Greg - Original Message - From: Puneet Kishor To: SQLite Sent: Monday, April 19, 2004 10:41 AM Subject: Re: [sqlite] row size limit On Apr 18, 2004, at 7:31 PM, Greg Obleshchuk wrote: > Hi Richard, > You know that is the first clear and concise explanation of why not to > store large blobs in a database that I have heard anywhere. Indeed. But I wonder if most all databases do it the same way? Or do all file-based dbs do it the same way? etc. Nice explanation though. I am currently working on a web-based image gallery application powered by Oracle and am storing file names in the db while the images themselves are stored on the disk. Some said why don't I just stick the images in the db and I said no... the OS would be better and more flexible at managing the physical files. Now I have a more scientific sounding answer to back my assertion. ;-) > > Greg > - Original Message - > From: D. Richard Hipp > Cc: [EMAIL PROTECTED] > Sent: Monday, April 19, 2004 9:50 AM > Subject: Re: [sqlite] row size limit > > > [EMAIL PROTECTED] wrote: >> According to the FAQ on sqlite.org, the row size is arbitrarily >> limited to 1MB, which can be increased to 16MB by changing a >> #define in the source code. >> >> My question is, why even limit the row size? Is there a way the >> code can modified so that there is no limit for the row size (other >> than the available disk/memory space)? >> > > The current file format allocates 24 bits for storing the number of > bytes in a particular row. (See > http://www.sqlite.org/fileformat.html > for details.) So the currect version of SQLite will never allow more > than 16MB in one row. The proposed version 3.0.0 of SQLite uses a > variable-length integer to store the row size and can thus accomodate > up to 2^64 bytes in a single row. In theory. > > But a limit of 1MB will probably still be enforced. Why is this? > > SQLite stores large rows by breaking the data up into 1K chunks and > storing each chunk in a separate page of the database. Filesystems > do much the same thing in an operating system. But filesystems have > an advantage over databases in that they can access the many chunks > that make up a file in parallel, whereas SQLite has to access them > one at a time. > > Suppose you have a 1MB row in SQLite and you want to read the whole > thing. SQLite must first ask for the 1st 1K page and wait for it to > be retrieved. Then it asks for the 2nd 1K page and waits for it. > And so forth for all 1000+ pages. If each page retrieval requires > 1 rotation of the disk, that's about 8.5 seconds on a 7200RPM disk > drive. > > The filesystem code inside the operating system can do the same task > in parallel. If you ask the operating system to read in all of a > 1MB file for you, it can request many separate blocks from the disk > controller at once. The blocks might arrive out of order, but the > OS can reassemble them into the correct order before returning the > result up to the user-space process. Using this approach, only a few > rotations of the disk platter would be required to retrieve a 1MB > file, instead of thousands. The retrival will be 100s of times > faster. > > The moral of the story: If you have large amounts of data you want > to store, it is best to store that data in a separate file and then > write the name of that file into the database. > -- > D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
On Apr 18, 2004, at 7:31 PM, Greg Obleshchuk wrote: Hi Richard, You know that is the first clear and concise explanation of why not to store large blobs in a database that I have heard anywhere. Indeed. But I wonder if most all databases do it the same way? Or do all file-based dbs do it the same way? etc. Nice explanation though. I am currently working on a web-based image gallery application powered by Oracle and am storing file names in the db while the images themselves are stored on the disk. Some said why don't I just stick the images in the db and I said no... the OS would be better and more flexible at managing the physical files. Now I have a more scientific sounding answer to back my assertion. ;-) Greg - Original Message - From: D. Richard Hipp Cc: [EMAIL PROTECTED] Sent: Monday, April 19, 2004 9:50 AM Subject: Re: [sqlite] row size limit [EMAIL PROTECTED] wrote: According to the FAQ on sqlite.org, the row size is arbitrarily limited to 1MB, which can be increased to 16MB by changing a #define in the source code. My question is, why even limit the row size? Is there a way the code can modified so that there is no limit for the row size (other than the available disk/memory space)? The current file format allocates 24 bits for storing the number of bytes in a particular row. (See http://www.sqlite.org/fileformat.html for details.) So the currect version of SQLite will never allow more than 16MB in one row. The proposed version 3.0.0 of SQLite uses a variable-length integer to store the row size and can thus accomodate up to 2^64 bytes in a single row. In theory. But a limit of 1MB will probably still be enforced. Why is this? SQLite stores large rows by breaking the data up into 1K chunks and storing each chunk in a separate page of the database. Filesystems do much the same thing in an operating system. But filesystems have an advantage over databases in that they can access the many chunks that make up a file in parallel, whereas SQLite has to access them one at a time. Suppose you have a 1MB row in SQLite and you want to read the whole thing. SQLite must first ask for the 1st 1K page and wait for it to be retrieved. Then it asks for the 2nd 1K page and waits for it. And so forth for all 1000+ pages. If each page retrieval requires 1 rotation of the disk, that's about 8.5 seconds on a 7200RPM disk drive. The filesystem code inside the operating system can do the same task in parallel. If you ask the operating system to read in all of a 1MB file for you, it can request many separate blocks from the disk controller at once. The blocks might arrive out of order, but the OS can reassemble them into the correct order before returning the result up to the user-space process. Using this approach, only a few rotations of the disk platter would be required to retrieve a 1MB file, instead of thousands. The retrival will be 100s of times faster. The moral of the story: If you have large amounts of data you want to store, it is best to store that data in a separate file and then write the name of that file into the database. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
Hi Richard, You know that is the first clear and concise explanation of why not to store large blobs in a database that I have heard anywhere. Greg - Original Message - From: D. Richard Hipp Cc: [EMAIL PROTECTED] Sent: Monday, April 19, 2004 9:50 AM Subject: Re: [sqlite] row size limit [EMAIL PROTECTED] wrote: > According to the FAQ on sqlite.org, the row size is arbitrarily > limited to 1MB, which can be increased to 16MB by changing a > #define in the source code. > > My question is, why even limit the row size? Is there a way the > code can modified so that there is no limit for the row size (other > than the available disk/memory space)? > The current file format allocates 24 bits for storing the number of bytes in a particular row. (See http://www.sqlite.org/fileformat.html for details.) So the currect version of SQLite will never allow more than 16MB in one row. The proposed version 3.0.0 of SQLite uses a variable-length integer to store the row size and can thus accomodate up to 2^64 bytes in a single row. In theory. But a limit of 1MB will probably still be enforced. Why is this? SQLite stores large rows by breaking the data up into 1K chunks and storing each chunk in a separate page of the database. Filesystems do much the same thing in an operating system. But filesystems have an advantage over databases in that they can access the many chunks that make up a file in parallel, whereas SQLite has to access them one at a time. Suppose you have a 1MB row in SQLite and you want to read the whole thing. SQLite must first ask for the 1st 1K page and wait for it to be retrieved. Then it asks for the 2nd 1K page and waits for it. And so forth for all 1000+ pages. If each page retrieval requires 1 rotation of the disk, that's about 8.5 seconds on a 7200RPM disk drive. The filesystem code inside the operating system can do the same task in parallel. If you ask the operating system to read in all of a 1MB file for you, it can request many separate blocks from the disk controller at once. The blocks might arrive out of order, but the OS can reassemble them into the correct order before returning the result up to the user-space process. Using this approach, only a few rotations of the disk platter would be required to retrieve a 1MB file, instead of thousands. The retrival will be 100s of times faster. The moral of the story: If you have large amounts of data you want to store, it is best to store that data in a separate file and then write the name of that file into the database. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [sqlite] row size limit
[EMAIL PROTECTED] wrote: According to the FAQ on sqlite.org, the row size is arbitrarily limited to 1MB, which can be increased to 16MB by changing a #define in the source code. My question is, why even limit the row size? Is there a way the code can modified so that there is no limit for the row size (other than the available disk/memory space)? The current file format allocates 24 bits for storing the number of bytes in a particular row. (See http://www.sqlite.org/fileformat.html for details.) So the currect version of SQLite will never allow more than 16MB in one row. The proposed version 3.0.0 of SQLite uses a variable-length integer to store the row size and can thus accomodate up to 2^64 bytes in a single row. In theory. But a limit of 1MB will probably still be enforced. Why is this? SQLite stores large rows by breaking the data up into 1K chunks and storing each chunk in a separate page of the database. Filesystems do much the same thing in an operating system. But filesystems have an advantage over databases in that they can access the many chunks that make up a file in parallel, whereas SQLite has to access them one at a time. Suppose you have a 1MB row in SQLite and you want to read the whole thing. SQLite must first ask for the 1st 1K page and wait for it to be retrieved. Then it asks for the 2nd 1K page and waits for it. And so forth for all 1000+ pages. If each page retrieval requires 1 rotation of the disk, that's about 8.5 seconds on a 7200RPM disk drive. The filesystem code inside the operating system can do the same task in parallel. If you ask the operating system to read in all of a 1MB file for you, it can request many separate blocks from the disk controller at once. The blocks might arrive out of order, but the OS can reassemble them into the correct order before returning the result up to the user-space process. Using this approach, only a few rotations of the disk platter would be required to retrieve a 1MB file, instead of thousands. The retrival will be 100s of times faster. The moral of the story: If you have large amounts of data you want to store, it is best to store that data in a separate file and then write the name of that file into the database. -- D. Richard Hipp -- [EMAIL PROTECTED] -- 704.948.4565 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]