On 11/27/07, Spiros Ioannou <[EMAIL PROTECTED]> wrote: > I had a 135MB, 1256132 lines, '@' separated text file containing > various words and text fields (like a dictionary). > Example record: > [EMAIL PROTECTED]@[EMAIL PROTECTED],[EMAIL PROTECTED] altana : μικρός > ανθόκηπος - εξώστης, > ταράτσα@@@@@@@@@@@
> I imported the data in sqlite3.3.6 but when querying with the 'like' > operator, the performance way too slow (about 1.5-2 seconds/query): > > >time sqlite3 dicts.db "select * from table1 where word like 'asdf%';" > 1.156u 0.491s 0:01.64 100.0% 0+0k 0+0io 0pf+0w > FYI using egrep takes only 0.14s to get results in the worse case scenario: > >time egrep -i "[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@asdf" > meta.txt > 0.077u 0.069s 0:00.14 92.8% 0+0k 0+0io 0pf+0w > > 1) I know egrep is not a DB but does sqlite use such an inefficient > search algorithm for content that cannot be indexed? Why not reverting > to simple 'grep-like' methods? Or am I missing something trivial here? As a database, the file contains a LOT of structure. SQLite must follow the structure to locate the table, each record in the table, and expand the text field from its stored format. (The text itself is not a big deal, but the row/column that stores it must be found and extracted.) The data is not necessarily stored end-to-end sequentially in the file, as it's impossible to do that and still maintain all the necessary properties of a structured database. egrep gets to work with a flat text file, which it can easily read sequentially and get optimal performance from the OS's file buffering/cache management. It only needs to read a piece of the file and scan for patterns, repeating until done. The only structure it needs to be aware of is line breaks, but that is so simple it can be folded into the pattern scan itself. While someone would need to do profiling to examine exactly where the time goes, it would not suirprise me to find that SQLite's LIKE pattern matcher is more efficient than egrep, but that the overhead from dealing with structured data is responsible for the time difference. I don't find the time itself surprising at all. > 2) Why doesn't an index raise performance at all in this case? Is it > because non-latin chars are used? Careful use of an index should help for the specific query you posted (see http://sqlite.org/optoverview.html#like_opt), but it's not possible for an index to speed up arbitrary patterns. If you need to perform arbitrary pattern searches on a flat text file, SQLite (and most other structured storage for that matter) is simply the wrong tool for the job. grep and friends are highly optimized for just that purpose.