Re: [sqlite] Unicode Confusion and Database Size

2011-09-04 Thread Mohit Sindhwani

Hi Simon,

On 4/9/2011 12:01 AM, Simon Slavin wrote:
Have you tried speed tests on your platform ? It's hard to tell which 
will be faster because it depends on what language and OS you're using 
that interacts with SQLite. So if you have your schema designed and 
any part of your application written you could try doing lots of 
information shuffling and see if UTF-8 or UTF-16 is usefully faster 
than the other. Saving space for your database files may be 
interesting, but it's useful only if you're short of space, or need 
faster backups, or something else related. These days most work 
computers have tons of free space on their hard disks.


No, we have not yet done speed tests actually.  We are using SQLite3 
from C/C++ on a Windows CE device... so, space is a premium (as is CPU 
performance actually).  Some of the things we are doing are hitting 1.6 
~ 1.8GB in the databases.  We're already using CEROD to compress the 
databases to bring down the disk requirement and that brings it down by 
around 35%.


We're right now refactoring the database across all its uses and trying 
to bring down the space required while improving the performance... 
we've not decided what to do - just exploring ideas that we've collected 
in the past few weeks.


Cheers,
Mohit.
4/9/2011 | 11:20 PM.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode Confusion and Database Size

2011-09-03 Thread Simon Slavin

On 3 Sep 2011, at 4:27pm, Mohit Sindhwani wrote:

> for our data, we can get savings in the region of 25% - 33% in the case of 
> strings being stored in a language that does require 3bytes/ character.  So, 
> given that, we should explore UTF-16 in more detail.  However, we also have a 
> lot of text that is only in English - so, it seems that we should go down the 
> path of separating the data in the two languages and use an ATTACH to bring 
> in the other language.  That may be best for our needs.

Have you tried speed tests on your platform ?  It's hard to tell which will be 
faster because it depends on what language and OS you're using that interacts 
with SQLite.  So if you have your schema designed and any part of your 
application written you could try doing lots of information shuffling and see 
if UTF-8 or UTF-16 is usefully faster than the other.

Saving space for your database files may be interesting, but it's useful only 
if you're short of space, or need faster backups, or something else related.  
These days most work computers have tons of free space on their hard disks.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode Confusion and Database Size

2011-09-03 Thread Mohit Sindhwani

Hi Igor,

Thanks for your advice and guidance.

On 1/9/2011 11:57 PM, Igor Tandetnik wrote:

On 9/1/2011 10:24 AM, Mohit Sindhwani wrote:


On the other hand, the other language that we are storing seems to
require 3 bytes in UTF-8. Given that, it would appear that using UTF-8
would be a better idea since it will store more "efficiently".


If you have lots of Chinese (or Japanese or Korean) text to store, 
then UTF-16 might be more compact. For these languages, one character 
takes three bytes in UTF-8 but only two in UTF-16. On the other hand, 
plain ASCII characters take one byte in UTF-8 but still two bytes in 
UTF-16. So if you have a mix of the two, the issue gets murky.






I already have a database that has a couple of tables that are in UTF-8
- is there an easy way for me to build a database from this that is 
UTF-16?


Using sqlite3 command line utility, run .dump command on the old 
database. Create a new database. Use "PRAGMA encoding" to set it to 
UTF-16. Run .import command on it using the dump file from the old one.


I tried what you suggested and for our data, we can get savings in the 
region of 25% - 33% in the case of strings being stored in a language 
that does require 3bytes/ character.  So, given that, we should explore 
UTF-16 in more detail.  However, we also have a lot of text that is only 
in English - so, it seems that we should go down the path of separating 
the data in the two languages and use an ATTACH to bring in the other 
language.  That may be best for our needs.


Thanks for the tips so far!

Best Regards,
Mohit.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode Confusion and Database Size

2011-09-01 Thread Mohit Sindhwani

Hi Igor,

On 1/9/2011 11:57 PM, Igor Tandetnik wrote:

On 9/1/2011 10:24 AM, Mohit Sindhwani wrote:

I understand that the database could be either UTF-8 or UTF-16 - but
that would apply to the full DB not to a single column, right?


Right.



*many useful answers snipped*

Thank you very much!! Your answers are perfect to get me started.  I 
have a couple of tables that are going to have the wider characters - it 
may be that we'll end up splitting the text and FTS tables for that 
language into a separate database and just attach it if we need it.


But at least, I can try this out now.

Thanks,
Mohit.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Unicode Confusion and Database Size

2011-09-01 Thread Igor Tandetnik

On 9/1/2011 10:24 AM, Mohit Sindhwani wrote:

I understand that the database could be either UTF-8 or UTF-16 - but
that would apply to the full DB not to a single column, right?


Right.


If that
is the case, would it not make the database larger if we had a lot of
content that was originally ASCII?


UTF-8 is a superset of ASCII. Strings that consist entirely of 7-bit 
ASCII characters are represented exactly the same way in ASCII and in 
UTF-8. That's probably what you want to stick with.



On the other hand, the other language that we are storing seems to
require 3 bytes in UTF-8. Given that, it would appear that using UTF-8
would be a better idea since it will store more "efficiently".


If you have lots of Chinese (or Japanese or Korean) text to store, then 
UTF-16 might be more compact. For these languages, one character takes 
three bytes in UTF-8 but only two in UTF-16. On the other hand, plain 
ASCII characters take one byte in UTF-8 but still two bytes in UTF-16. 
So if you have a mix of the two, the issue gets murky.



In addition, there are a few other questions:
- FTS would work fine on both UTF-8 and UTF-16 databases, right?


I believe so, but I'm not very familiar with FTS.


- Can we attach two databases that have different encodings?


Yes. SQLite automatically converts between them as needed, in a 
transparent fashion.



- When using Wide Strings in Windows CE, is one encoding more preferable
over the other to minimize conversions?


Native API in Windows uses UTF-16. You can request UTF-16 strings even 
from UTF-8 database - like I said, SQLite converts between them 
transparently. The cost of conversion is likely negligible compared to 
the other costs of maintaining a database. In fact, UTF-8 might win 
simply because it means less data to read from hard drive, even if it 
requires conversion. The only way to be sure is to test and measure.



I already have a database that has a couple of tables that are in UTF-8
- is there an easy way for me to build a database from this that is UTF-16?


Using sqlite3 command line utility, run .dump command on the old 
database. Create a new database. Use "PRAGMA encoding" to set it to 
UTF-16. Run .import command on it using the dump file from the old one.

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users