On 11/28/07, Spiros Ioannou <[EMAIL PROTECTED]> wrote:

> > egrep gets to work with a flat text file, which it can easily read
> > sequentially and get optimal performance from the OS's file
> > buffering/cache management.  It only needs to read a piece of the file
> > and scan for patterns, repeating until done.  The only structure it
> > needs to be aware of is line breaks, but that is so simple it can be
> > folded into the pattern scan itself.
>
> You are mostly describing fgrep. Egrep does a regular expression search
> thus honouring column layout in the example I gave earlier. (FYI fgrep
> is 50% faster than egrep than egrep in this query)

I was describing grep and all related tools. The only relevant
difference between egrep and fgrep is what kind of pattern it looks
for; the searching method is the same. Egrep is not aware of column
structure, only lines.

> The points are two:
> 1) Could sqlite structure allow for faster full column scan queries?
>     More than 1000% slower performance seems way too much difference.

The short answer is no.

The longer answer is maybe in theory (see research into
column-oriented or "vertical" datastores, such as the much-publicized
Vertica), but SQLite's usage scenarios don't really make it
appropriate. It would probably mean abandoning the single-file
database, for instance, and would make many other common queries much
less efficient. That's ignoring all sorts of other practical issues,
like drh's careful avoidance of patent-encumbered things.

It's possible there is room for optimizing this case a bit more the
way it is, but it will always be much slower than grep.

You may not realize what I mean about structured data. A plaintext
file looks like this:

    LineLineLineLineLineLineLine...

An sqlite database, on the other hand, has lots of structure. It's
broken into fixed-sized pages, each of which has a small header at the
front. Each table needs a way to find the rows in that belong to it
(index on the rowid field). For each row, it needs to be able to
locate each field of data belonging to that row. Each field has,
besides the data itself, a type code and data length. All of this is
encoded in various space-efficient forms in the same file.

So, for instance, consider that a field of data is stored like:

   TypeLengthData...

and you can see where this is going.

When SQLite searches it, it needs to follow the structure to know
where it is and what it's looking at. It can't simply read the next
chunk of data and do pattern matching on it immediately, like grep
can.


> 2) an index could (and should) be used when using LIKE 'asdf'
>
> Well I actually solved this problem by using
> PRAGMA case_sensitive_like = 1;
> Now the index is being used (for all non-english characters).

Which is what you wanted, right?

> But the online manual states:
>
> "
> if case_sensitive_like mode is enabled then the column must use the
> default BINARY collating sequence,
>   if case_sensitive_like mode is disabled then the column must use the b
> uilt-in NOCASE collating sequence.
>
>   ....
>   NOCASE - The same as binary, except the 26 upper case characters used
>   by the English language are folded to their lower case equivalents
> before the comparison is performed.
> "
>
> so as I understand it there's no reason that the index was not used in
> the NOCASE scenario. It should do case-insensitive matches for the 26
> latin characters and case-sensitive (binary) for all the others.

It might be a little hard to process that part the first few times.
You can declare a collation when creating the index itself, either
BINARY or NOCASE. The default is BINARY, which is case sensitive (does
not assume US ASCII or do anything with case).

By default, LIKE is case insensitive, which means it does care about
case so it can treat 'A' as 'a'. Therefore it can only use a NOCASE
index for optimizing prefix matches, because the only way for an index
to work is if it treats the characters the same way. NOCASE is not the
index default though, and it isn't what you wanted because you have
non-english data anyway.

By turning case_sensitive_like on, you made LIKE look for an index
with BINARY collation instead. That matches the index you created, so
it can optimize that case now.

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to