On 11/28/07, Spiros Ioannou <[EMAIL PROTECTED]> wrote: > > egrep gets to work with a flat text file, which it can easily read > > sequentially and get optimal performance from the OS's file > > buffering/cache management. It only needs to read a piece of the file > > and scan for patterns, repeating until done. The only structure it > > needs to be aware of is line breaks, but that is so simple it can be > > folded into the pattern scan itself. > > You are mostly describing fgrep. Egrep does a regular expression search > thus honouring column layout in the example I gave earlier. (FYI fgrep > is 50% faster than egrep than egrep in this query)
I was describing grep and all related tools. The only relevant difference between egrep and fgrep is what kind of pattern it looks for; the searching method is the same. Egrep is not aware of column structure, only lines. > The points are two: > 1) Could sqlite structure allow for faster full column scan queries? > More than 1000% slower performance seems way too much difference. The short answer is no. The longer answer is maybe in theory (see research into column-oriented or "vertical" datastores, such as the much-publicized Vertica), but SQLite's usage scenarios don't really make it appropriate. It would probably mean abandoning the single-file database, for instance, and would make many other common queries much less efficient. That's ignoring all sorts of other practical issues, like drh's careful avoidance of patent-encumbered things. It's possible there is room for optimizing this case a bit more the way it is, but it will always be much slower than grep. You may not realize what I mean about structured data. A plaintext file looks like this: LineLineLineLineLineLineLine... An sqlite database, on the other hand, has lots of structure. It's broken into fixed-sized pages, each of which has a small header at the front. Each table needs a way to find the rows in that belong to it (index on the rowid field). For each row, it needs to be able to locate each field of data belonging to that row. Each field has, besides the data itself, a type code and data length. All of this is encoded in various space-efficient forms in the same file. So, for instance, consider that a field of data is stored like: TypeLengthData... and you can see where this is going. When SQLite searches it, it needs to follow the structure to know where it is and what it's looking at. It can't simply read the next chunk of data and do pattern matching on it immediately, like grep can. > 2) an index could (and should) be used when using LIKE 'asdf' > > Well I actually solved this problem by using > PRAGMA case_sensitive_like = 1; > Now the index is being used (for all non-english characters). Which is what you wanted, right? > But the online manual states: > > " > if case_sensitive_like mode is enabled then the column must use the > default BINARY collating sequence, > if case_sensitive_like mode is disabled then the column must use the b > uilt-in NOCASE collating sequence. > > .... > NOCASE - The same as binary, except the 26 upper case characters used > by the English language are folded to their lower case equivalents > before the comparison is performed. > " > > so as I understand it there's no reason that the index was not used in > the NOCASE scenario. It should do case-insensitive matches for the 26 > latin characters and case-sensitive (binary) for all the others. It might be a little hard to process that part the first few times. You can declare a collation when creating the index itself, either BINARY or NOCASE. The default is BINARY, which is case sensitive (does not assume US ASCII or do anything with case). By default, LIKE is case insensitive, which means it does care about case so it can treat 'A' as 'a'. Therefore it can only use a NOCASE index for optimizing prefix matches, because the only way for an index to work is if it treats the characters the same way. NOCASE is not the index default though, and it isn't what you wanted because you have non-english data anyway. By turning case_sensitive_like on, you made LIKE look for an index with BINARY collation instead. That matches the index you created, so it can optimize that case now. ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------