Re: [sqlite] Unicode Confusion and Database Size
Hi Simon, On 4/9/2011 12:01 AM, Simon Slavin wrote: Have you tried speed tests on your platform ? It's hard to tell which will be faster because it depends on what language and OS you're using that interacts with SQLite. So if you have your schema designed and any part of your application written you could try doing lots of information shuffling and see if UTF-8 or UTF-16 is usefully faster than the other. Saving space for your database files may be interesting, but it's useful only if you're short of space, or need faster backups, or something else related. These days most work computers have tons of free space on their hard disks. No, we have not yet done speed tests actually. We are using SQLite3 from C/C++ on a Windows CE device... so, space is a premium (as is CPU performance actually). Some of the things we are doing are hitting 1.6 ~ 1.8GB in the databases. We're already using CEROD to compress the databases to bring down the disk requirement and that brings it down by around 35%. We're right now refactoring the database across all its uses and trying to bring down the space required while improving the performance... we've not decided what to do - just exploring ideas that we've collected in the past few weeks. Cheers, Mohit. 4/9/2011 | 11:20 PM. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode Confusion and Database Size
On 3 Sep 2011, at 4:27pm, Mohit Sindhwani wrote: > for our data, we can get savings in the region of 25% - 33% in the case of > strings being stored in a language that does require 3bytes/ character. So, > given that, we should explore UTF-16 in more detail. However, we also have a > lot of text that is only in English - so, it seems that we should go down the > path of separating the data in the two languages and use an ATTACH to bring > in the other language. That may be best for our needs. Have you tried speed tests on your platform ? It's hard to tell which will be faster because it depends on what language and OS you're using that interacts with SQLite. So if you have your schema designed and any part of your application written you could try doing lots of information shuffling and see if UTF-8 or UTF-16 is usefully faster than the other. Saving space for your database files may be interesting, but it's useful only if you're short of space, or need faster backups, or something else related. These days most work computers have tons of free space on their hard disks. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode Confusion and Database Size
Hi Igor, Thanks for your advice and guidance. On 1/9/2011 11:57 PM, Igor Tandetnik wrote: On 9/1/2011 10:24 AM, Mohit Sindhwani wrote: On the other hand, the other language that we are storing seems to require 3 bytes in UTF-8. Given that, it would appear that using UTF-8 would be a better idea since it will store more "efficiently". If you have lots of Chinese (or Japanese or Korean) text to store, then UTF-16 might be more compact. For these languages, one character takes three bytes in UTF-8 but only two in UTF-16. On the other hand, plain ASCII characters take one byte in UTF-8 but still two bytes in UTF-16. So if you have a mix of the two, the issue gets murky. I already have a database that has a couple of tables that are in UTF-8 - is there an easy way for me to build a database from this that is UTF-16? Using sqlite3 command line utility, run .dump command on the old database. Create a new database. Use "PRAGMA encoding" to set it to UTF-16. Run .import command on it using the dump file from the old one. I tried what you suggested and for our data, we can get savings in the region of 25% - 33% in the case of strings being stored in a language that does require 3bytes/ character. So, given that, we should explore UTF-16 in more detail. However, we also have a lot of text that is only in English - so, it seems that we should go down the path of separating the data in the two languages and use an ATTACH to bring in the other language. That may be best for our needs. Thanks for the tips so far! Best Regards, Mohit. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode Confusion and Database Size
Hi Igor, On 1/9/2011 11:57 PM, Igor Tandetnik wrote: On 9/1/2011 10:24 AM, Mohit Sindhwani wrote: I understand that the database could be either UTF-8 or UTF-16 - but that would apply to the full DB not to a single column, right? Right. *many useful answers snipped* Thank you very much!! Your answers are perfect to get me started. I have a couple of tables that are going to have the wider characters - it may be that we'll end up splitting the text and FTS tables for that language into a separate database and just attach it if we need it. But at least, I can try this out now. Thanks, Mohit. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode Confusion and Database Size
On 9/1/2011 10:24 AM, Mohit Sindhwani wrote: I understand that the database could be either UTF-8 or UTF-16 - but that would apply to the full DB not to a single column, right? Right. If that is the case, would it not make the database larger if we had a lot of content that was originally ASCII? UTF-8 is a superset of ASCII. Strings that consist entirely of 7-bit ASCII characters are represented exactly the same way in ASCII and in UTF-8. That's probably what you want to stick with. On the other hand, the other language that we are storing seems to require 3 bytes in UTF-8. Given that, it would appear that using UTF-8 would be a better idea since it will store more "efficiently". If you have lots of Chinese (or Japanese or Korean) text to store, then UTF-16 might be more compact. For these languages, one character takes three bytes in UTF-8 but only two in UTF-16. On the other hand, plain ASCII characters take one byte in UTF-8 but still two bytes in UTF-16. So if you have a mix of the two, the issue gets murky. In addition, there are a few other questions: - FTS would work fine on both UTF-8 and UTF-16 databases, right? I believe so, but I'm not very familiar with FTS. - Can we attach two databases that have different encodings? Yes. SQLite automatically converts between them as needed, in a transparent fashion. - When using Wide Strings in Windows CE, is one encoding more preferable over the other to minimize conversions? Native API in Windows uses UTF-16. You can request UTF-16 strings even from UTF-8 database - like I said, SQLite converts between them transparently. The cost of conversion is likely negligible compared to the other costs of maintaining a database. In fact, UTF-8 might win simply because it means less data to read from hard drive, even if it requires conversion. The only way to be sure is to test and measure. I already have a database that has a couple of tables that are in UTF-8 - is there an easy way for me to build a database from this that is UTF-16? Using sqlite3 command line utility, run .dump command on the old database. Create a new database. Use "PRAGMA encoding" to set it to UTF-16. Run .import command on it using the dump file from the old one. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users