On 11/27/07, Spiros Ioannou <[EMAIL PROTECTED]> wrote:

> I had a 135MB, 1256132 lines,  '@' separated text file containing
> various words and text fields (like a dictionary).
> Example record:
> [EMAIL PROTECTED]@[EMAIL PROTECTED],[EMAIL PROTECTED] altana : μικρός 
> ανθόκηπος - εξώστης,
> ταράτσα@@@@@@@@@@@

> I imported the data in sqlite3.3.6 but when querying with the 'like'
> operator, the performance way too slow (about 1.5-2 seconds/query):
>
>  >time sqlite3 dicts.db "select * from table1 where word like 'asdf%';"
> 1.156u 0.491s 0:01.64 100.0%    0+0k 0+0io 0pf+0w

> FYI using egrep takes only 0.14s to get results in the worse case scenario:
>  >time egrep -i "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@asdf" 
> meta.txt
> 0.077u 0.069s 0:00.14 92.8%     0+0k 0+0io 0pf+0w
>
> 1) I know egrep is not a DB but does sqlite use such an inefficient
> search algorithm for content that cannot be indexed? Why not reverting
> to simple 'grep-like' methods? Or am I missing something trivial here?

As a database, the file contains a LOT of structure.  SQLite must
follow the structure to locate the table, each record in the table,
and expand the text field from its stored format.  (The text itself is
not a big deal, but the row/column that stores it must be found and
extracted.)  The data is not necessarily stored end-to-end
sequentially in the file, as it's impossible to do that and still
maintain all the necessary properties of a structured database.

egrep gets to work with a flat text file, which it can easily read
sequentially and get optimal performance from the OS's file
buffering/cache management.  It only needs to read a piece of the file
and scan for patterns, repeating until done.  The only structure it
needs to be aware of is line breaks, but that is so simple it can be
folded into the pattern scan itself.

While someone would need to do profiling to examine exactly where the
time goes, it would not suirprise me to find that SQLite's LIKE
pattern matcher is more efficient than egrep, but that the overhead
from dealing with structured data is responsible for the time
difference.  I don't find the time itself surprising at all.

> 2) Why doesn't an index raise performance at all in this case? Is it
> because non-latin chars are used?

Careful use of an index should help for the specific query you posted
(see http://sqlite.org/optoverview.html#like_opt), but it's not
possible for an index to speed up arbitrary patterns.

If you need to perform arbitrary pattern searches on a flat text file,
SQLite (and most other structured storage for that matter) is simply
the wrong tool for the job.  grep and friends are highly optimized for
just that purpose.

Reply via email to