subject:"Searching for lots of similar strings \(filenames\) in sqlite3 database"

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-02 Thread Adam Funk

On 2014-07-02, Chris Angelico wrote:

> On Wed, Jul 2, 2014 at 7:32 PM, Adam Funk  wrote:
>> Well, I've changed it to the following anyway.
>>
>> subdir_glob = subdir + '/*'
>> cursor.execute('SELECT filename FROM files WHERE filename GLOB ?',
>>(subdir_glob,))
>> rows = cursor.fetchall()
>> known_files = {row[0] for row in rows}
>>
>> I see what you mean about paths containing '%', but I don't see why
>> you were concerned about underscores, though.
>
> With GLOB, presumably ? matches a single character and * matches any
> number of characters. With LIKE, _ matches a single character and %
> matches any number. So, for instance, WHERE filename LIKE
> '/foo/bar/spam_spam/%' will match '/foo/bar/spam2spam/1234', which may
> be a little surprising. It's not going to be a serious problem in most
> cases, as it'll also match '/foo/bar/spam_spam/1234', but the false
> positives will make one of those "Huh" moments if you don't
> keep an eye on your magic characters.
>
> In your specific case, you happen to be safe, but as I look over the
> code, my paranoia kicks in and tells me to check :) It's just one of
> those things that flags itself to the mind - anything that might help
> catch bugs early is a good feature of the mind, in my opinion!

Oh, I'd just missed the '_' in the LIKE documentation.  Doh!


-- 
Indentation is for enemy skulls, not code!
--- Klingon Programmer's Guide
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-02 Thread Chris Angelico

On Wed, Jul 2, 2014 at 7:32 PM, Adam Funk  wrote:
> Well, I've changed it to the following anyway.
>
> subdir_glob = subdir + '/*'
> cursor.execute('SELECT filename FROM files WHERE filename GLOB ?',
>(subdir_glob,))
> rows = cursor.fetchall()
> known_files = {row[0] for row in rows}
>
> I see what you mean about paths containing '%', but I don't see why
> you were concerned about underscores, though.

With GLOB, presumably ? matches a single character and * matches any
number of characters. With LIKE, _ matches a single character and %
matches any number. So, for instance, WHERE filename LIKE
'/foo/bar/spam_spam/%' will match '/foo/bar/spam2spam/1234', which may
be a little surprising. It's not going to be a serious problem in most
cases, as it'll also match '/foo/bar/spam_spam/1234', but the false
positives will make one of those "Huh" moments if you don't
keep an eye on your magic characters.

In your specific case, you happen to be safe, but as I look over the
code, my paranoia kicks in and tells me to check :) It's just one of
those things that flags itself to the mind - anything that might help
catch bugs early is a good feature of the mind, in my opinion!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-02 Thread Adam Funk

On 2014-07-01, Chris Angelico wrote:

> On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk  wrote:
>> On 2014-07-01, Chris Angelico wrote:

>>> There is one critical consideration, though. What happens if the
>>> directory name contains an underscore or percent sign? Or can you
>>> absolutely guarantee that they won't? You may need to escape them, and
>>> I'm not sure how SQLite handles that. (Possibly \_ will match literal
>>> _, and \\ will match literal \, or something like that.)
>>
>> I can guarantee that the directory names are all
>> '/var/spool/news/message.id/' then 3 digits.  (The filenames are
>> pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
>> in the sqlite3 library is supposed to be safe.
>
> This is nothing to do with question-mark substitution. There are two
> separate levels of character significance here - it's like a quoted
> string with a regex. Suppose you want to make a regex that searches
> for an apostrophe. If you try to define that in a single-quoted
> string, you need to escape it:
>
> regex = '^\'$'
>
> However, if you ask the user to enter a regex, that wouldn't be necessary:
>
> regex = input("Enter a pattern: ") # raw_input in Python 2
> Enter a pattern: ^'$
>
> This is what the question mark substitution is like - it avoids the
> need to carefully manage string delimiters and so on. However, if you
> want to make a regex that searches for a backslash, then you need to
> escape it, because the backslash is important to the regex itself. In
> the same way, the underscore and percent sign are significant to the
> LIKE operator. If it were possible to have a directory name with a
> percent sign in it, it would match far too much - because you'd
> construct a LIKE pattern something like (ahem)
> "/var/spool/news/message%20id/142/%" - and as you can see, the percent
> sign at the end is no different from the percent sign in the middle.
>
> But you're safe because you know your data, unrelated to your
> substitution method. Possibly merits a comment... but possibly not
> worth it.

Well, I've changed it to the following anyway.

subdir_glob = subdir + '/*'
cursor.execute('SELECT filename FROM files WHERE filename GLOB ?',
   (subdir_glob,))
rows = cursor.fetchall()
known_files = {row[0] for row in rows}

I see what you mean about paths containing '%', but I don't see why
you were concerned about underscores, though.


-- 
You know, there are many people in the country today who, through no
fault of their own, are sane. Some of them were born sane. Some of
them became sane later in their lives.--― Graham Chapman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-02 Thread Adam Funk

On 2014-07-01, Chris Angelico wrote:

> On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk  wrote:
>> On 2014-07-01, Chris Angelico wrote:
>>
>>> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk  wrote:
 cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
 (filename,))
>>>
>>> Shouldn't this be an equality check rather than IS, which normally I'd
>>> expect to be "IS NULL" or "IS NOT NULL"?
>>
>> Oh, it probably should be in "heavy" SQL.  In SQLite, '==', '=', &
>> 'IS' are interchangeable.
>>
>> http://www.tutorialspoint.com/sqlite/sqlite_operators.htm
>
> Ah, okay. In that case, I'd advise going with either == for
> consistency with the rest of Python, or (preferably) = for consistency
> with other SQL engines. You wouldn't use "is" to test if two Python
> strings are equal, so there's no particular reason to use it here :)

I agree.

>> Oh, even better:
>>
>> add_files = listing - known_files
>> delete_files = known_files - listing
>>
>> and then I can remove files that have disappeared off the spool from
>> the table.  Thanks very much!
>
> Ah! Didn't know that was a valuable feature for you, but getting that
> "for free" is an extra little bonus, so that's awesome!

I didn't know it was a valuable feature until I saw that easy way to
do it!  This will keep the database from growing indefinitely, though.



-- 
'...and Tom [Snyder] turns to him and says, "so Alice [Cooper], is it
true you kill chickens on stage?"  That was the opening question, and
Alice looks at him real serious and goes, "Oh no, no no.  That's
Colonel Sanders.  Colonel Sanders kills chickens."'
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread Chris Angelico

On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk  wrote:
> On 2014-07-01, Chris Angelico wrote:
>
>> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk  wrote:
>>> cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
>>> (filename,))
>>
>> Shouldn't this be an equality check rather than IS, which normally I'd
>> expect to be "IS NULL" or "IS NOT NULL"?
>
> Oh, it probably should be in "heavy" SQL.  In SQLite, '==', '=', &
> 'IS' are interchangeable.
>
> http://www.tutorialspoint.com/sqlite/sqlite_operators.htm

Ah, okay. In that case, I'd advise going with either == for
consistency with the rest of Python, or (preferably) = for consistency
with other SQL engines. You wouldn't use "is" to test if two Python
strings are equal, so there's no particular reason to use it here :)

> Oh, even better:
>
> add_files = listing - known_files
> delete_files = known_files - listing
>
> and then I can remove files that have disappeared off the spool from
> the table.  Thanks very much!

Ah! Didn't know that was a valuable feature for you, but getting that
"for free" is an extra little bonus, so that's awesome!

>> There is one critical consideration, though. What happens if the
>> directory name contains an underscore or percent sign? Or can you
>> absolutely guarantee that they won't? You may need to escape them, and
>> I'm not sure how SQLite handles that. (Possibly \_ will match literal
>> _, and \\ will match literal \, or something like that.)
>
> I can guarantee that the directory names are all
> '/var/spool/news/message.id/' then 3 digits.  (The filenames are
> pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
> in the sqlite3 library is supposed to be safe.

This is nothing to do with question-mark substitution. There are two
separate levels of character significance here - it's like a quoted
string with a regex. Suppose you want to make a regex that searches
for an apostrophe. If you try to define that in a single-quoted
string, you need to escape it:

regex = '^\'$'

However, if you ask the user to enter a regex, that wouldn't be necessary:

regex = input("Enter a pattern: ") # raw_input in Python 2
Enter a pattern: ^'$

This is what the question mark substitution is like - it avoids the
need to carefully manage string delimiters and so on. However, if you
want to make a regex that searches for a backslash, then you need to
escape it, because the backslash is important to the regex itself. In
the same way, the underscore and percent sign are significant to the
LIKE operator. If it were possible to have a directory name with a
percent sign in it, it would match far too much - because you'd
construct a LIKE pattern something like (ahem)
"/var/spool/news/message%20id/142/%" - and as you can see, the percent
sign at the end is no different from the percent sign in the middle.

But you're safe because you know your data, unrelated to your
substitution method. Possibly merits a comment... but possibly not
worth it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread Adam Funk

On 2014-07-01, Chris Angelico wrote:

> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk  wrote:
>> cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
>> (filename,))
>
> Shouldn't this be an equality check rather than IS, which normally I'd
> expect to be "IS NULL" or "IS NOT NULL"?

Oh, it probably should be in "heavy" SQL.  In SQLite, '==', '=', &
'IS' are interchangeable.

http://www.tutorialspoint.com/sqlite/sqlite_operators.htm

Looking at that page again, I see that 'GLOB' is a case-sensitive
version of 'LIKE'.  I can't help but wonder if that makes it faster.
;-)


> As to your actual question: Your two database lookups are doing
> distinctly different things, so there's no surprise that they perform
> very differently. B asks the database "Do you have this? Do you have
> this?" for every file you have, and C asks the database "What do you
> have?", and then comparing that against the list of files. By the way
> - the A+C technique could be done quite tidily as a set difference:
>
> # assume you have listing1 and cursor set up
> # as per your above code
> listing = {os.path.join(directory, x) for x in listing1}
> cursor.execute(...) # as per above
> known_files = {row[0] for row in cursor} # cursors are iterable
> needed_files = listing - known_files
> cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename,
> 0) for filename in needed_files))

Oh, even better:

add_files = listing - known_files
delete_files = known_files - listing

and then I can remove files that have disappeared off the spool from
the table.  Thanks very much!


> Anyway. The significant thing is the performance of the database on
> two different workloads: either "give me everything that matches this
> pattern" (where the pattern ends with a percent sign), or "do you have
> this? do you have this? do you have this?". Generally, database
> indexing is fairly efficient at handling prefix searches, so the first
> query will basically amount to an index search, which is a lot faster
> than the repeated separate searching; it takes advantage of the fact
> that all the strings you're looking at will have the same prefix.
>
> There is one critical consideration, though. What happens if the
> directory name contains an underscore or percent sign? Or can you
> absolutely guarantee that they won't? You may need to escape them, and
> I'm not sure how SQLite handles that. (Possibly \_ will match literal
> _, and \\ will match literal \, or something like that.)

I can guarantee that the directory names are all
'/var/spool/news/message.id/' then 3 digits.  (The filenames are
pretty wild, since they are MIDs.)  AIUI, using the '?' substitution
in the sqlite3 library is supposed to be safe.

> This is not bypassing the database's optimization; in fact, it's
> working tidily within it. 

That's reassuring!

...
> But doing the set difference in Python is just as good a way of doing the job.

I like it.  Thanks very much.


-- 
Specifications are for the weak & timid!
  --- Klingon Programmer's Guide
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread Chris Angelico

On Tue, Jul 1, 2014 at 10:13 PM, MRAB  wrote:
> Anyway, I'm sure there's something in SQL for "insert or update" or "on
> duplicate", but that's an SQL question, not a Python question.

Not in standard SQL, no; there might be in SQLite, as a non-standard
extension, but it's a fundamentally hard problem and it has issues.
Frankly, though, I doubt the time cost of set operations is anything
significant compared to the various queries against the database.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread MRAB


On 2014-07-01 12:26, Adam Funk wrote:

I have some code that reads files in a leafnode2 news spool & needs to
check for new files periodically.  The full paths are all like
'/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit
subdirectory & a Message-ID for the filename itself.  I'm using Python
3 & sqlite3 in the standard library.

I have a table of filenames created with the following command:

cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used 
INTEGER)')

To check for new files in one of the subdirectories, I run A then
either B or C below (I've tried both).

A.
 listing1 = os.listdir(directory)
 listing [os.path.join(directory, x) for x in listing1]

B.
 cursor = db_conn.cursor()
 for filename in listing:
 cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
(filename,))
 row = cursor.fetchone()
 if not row:
 cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
 files_new += 1
 db_conn.commit()

C.
 cursor = db_conn.cursor()
 subdir_like = directory + '/%'
 cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', 
(subdir_like,))
 rows = cursor.fetchall()
 known_files =  [row[0] for row in rows]
 for filename in listing:
 if filename not in known_files:
 cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
 files_new += 1
 db_conn.commit()

A+B was the first method I came up with, because it looks like the
"keep it simple & let the database do its job" approach, but it was
very time-consuming, so I tested A+C out.  A is quick (a second); B
can drag on for over an hour to check 2000 filenames (for example) in
a subdirectory; C always takes less than a minute.  So C is much
better than B, but it looks (to me) like one of those attempts to
bypass & ignore the database's built-in optimizations.

Comments?


In C, 'known_files' is a list, so it performs a linear search for each
of the filenames. If you make 'known_files' a set, it'll probably be
even faster!

Anyway, I'm sure there's something in SQL for "insert or update" or "on
duplicate", but that's an SQL question, not a Python question.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread Chris Angelico

On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk  wrote:
> cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
> (filename,))

Shouldn't this be an equality check rather than IS, which normally I'd
expect to be "IS NULL" or "IS NOT NULL"?

As to your actual question: Your two database lookups are doing
distinctly different things, so there's no surprise that they perform
very differently. B asks the database "Do you have this? Do you have
this?" for every file you have, and C asks the database "What do you
have?", and then comparing that against the list of files. By the way
- the A+C technique could be done quite tidily as a set difference:

# assume you have listing1 and cursor set up
# as per your above code
listing = {os.path.join(directory, x) for x in listing1}
cursor.execute(...) # as per above
known_files = {row[0] for row in cursor} # cursors are iterable
needed_files = listing - known_files
cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename,
0) for filename in needed_files))

Anyway. The significant thing is the performance of the database on
two different workloads: either "give me everything that matches this
pattern" (where the pattern ends with a percent sign), or "do you have
this? do you have this? do you have this?". Generally, database
indexing is fairly efficient at handling prefix searches, so the first
query will basically amount to an index search, which is a lot faster
than the repeated separate searching; it takes advantage of the fact
that all the strings you're looking at will have the same prefix.

There is one critical consideration, though. What happens if the
directory name contains an underscore or percent sign? Or can you
absolutely guarantee that they won't? You may need to escape them, and
I'm not sure how SQLite handles that. (Possibly \_ will match literal
_, and \\ will match literal \, or something like that.)

This is not bypassing the database's optimization; in fact, it's
working tidily within it. If you want to express your logic in a way
that lets the database truly be in command of optimization, it would
look something like this:

INSERT INTO files SELECT filename,0 FROM
(VALUES('foo'),('bar'),('quux'),('spam')) AS needed(filename) EXCEPT
SELECT filename,0 FROM files

And you'd need to check the SQLite docs for how to define a table that
consists of literal values. (The above syntax works in PostgreSQL; I'm
pretty sure it's not all standard SQL.)

But doing the set difference in Python is just as good a way of doing the job.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Searching for lots of similar strings (filenames) in sqlite3 database

2014-07-01 Thread Adam Funk

I have some code that reads files in a leafnode2 news spool & needs to
check for new files periodically.  The full paths are all like
'/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit
subdirectory & a Message-ID for the filename itself.  I'm using Python
3 & sqlite3 in the standard library.

I have a table of filenames created with the following command:

   cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used 
INTEGER)')

To check for new files in one of the subdirectories, I run A then
either B or C below (I've tried both).

A.
listing1 = os.listdir(directory)
listing [os.path.join(directory, x) for x in listing1]

B.
cursor = db_conn.cursor()
for filename in listing:
cursor.execute('SELECT filename FROM files WHERE filename IS ?', 
(filename,))
row = cursor.fetchone()
if not row:
cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
files_new += 1
db_conn.commit()

C.
cursor = db_conn.cursor()
subdir_like = directory + '/%'
cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', 
(subdir_like,))
rows = cursor.fetchall()
known_files =  [row[0] for row in rows]
for filename in listing:
if filename not in known_files:
cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) )
files_new += 1
db_conn.commit()

A+B was the first method I came up with, because it looks like the
"keep it simple & let the database do its job" approach, but it was
very time-consuming, so I tested A+C out.  A is quick (a second); B
can drag on for over an hour to check 2000 filenames (for example) in
a subdirectory; C always takes less than a minute.  So C is much
better than B, but it looks (to me) like one of those attempts to
bypass & ignore the database's built-in optimizations.

Comments?


-- 
No sport is less organized than Calvinball!
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Re: Searching for lots of similar strings (filenames) in sqlite3 database

Searching for lots of similar strings (filenames) in sqlite3 database

10 matches

Site Navigation

Mail list logo

Footer information