Re: Searching for lots of similar strings (filenames) in sqlite3 database
On 2014-07-02, Chris Angelico wrote: > On Wed, Jul 2, 2014 at 7:32 PM, Adam Funk wrote: >> Well, I've changed it to the following anyway. >> >> subdir_glob = subdir + '/*' >> cursor.execute('SELECT filename FROM files WHERE filename GLOB ?', >>(subdir_glob,)) >> rows = cursor.fetchall() >> known_files = {row[0] for row in rows} >> >> I see what you mean about paths containing '%', but I don't see why >> you were concerned about underscores, though. > > With GLOB, presumably ? matches a single character and * matches any > number of characters. With LIKE, _ matches a single character and % > matches any number. So, for instance, WHERE filename LIKE > '/foo/bar/spam_spam/%' will match '/foo/bar/spam2spam/1234', which may > be a little surprising. It's not going to be a serious problem in most > cases, as it'll also match '/foo/bar/spam_spam/1234', but the false > positives will make one of those "Huh" moments if you don't > keep an eye on your magic characters. > > In your specific case, you happen to be safe, but as I look over the > code, my paranoia kicks in and tells me to check :) It's just one of > those things that flags itself to the mind - anything that might help > catch bugs early is a good feature of the mind, in my opinion! Oh, I'd just missed the '_' in the LIKE documentation. Doh! -- Indentation is for enemy skulls, not code! --- Klingon Programmer's Guide -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On Wed, Jul 2, 2014 at 7:32 PM, Adam Funk wrote: > Well, I've changed it to the following anyway. > > subdir_glob = subdir + '/*' > cursor.execute('SELECT filename FROM files WHERE filename GLOB ?', >(subdir_glob,)) > rows = cursor.fetchall() > known_files = {row[0] for row in rows} > > I see what you mean about paths containing '%', but I don't see why > you were concerned about underscores, though. With GLOB, presumably ? matches a single character and * matches any number of characters. With LIKE, _ matches a single character and % matches any number. So, for instance, WHERE filename LIKE '/foo/bar/spam_spam/%' will match '/foo/bar/spam2spam/1234', which may be a little surprising. It's not going to be a serious problem in most cases, as it'll also match '/foo/bar/spam_spam/1234', but the false positives will make one of those "Huh" moments if you don't keep an eye on your magic characters. In your specific case, you happen to be safe, but as I look over the code, my paranoia kicks in and tells me to check :) It's just one of those things that flags itself to the mind - anything that might help catch bugs early is a good feature of the mind, in my opinion! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On 2014-07-01, Chris Angelico wrote: > On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk wrote: >> On 2014-07-01, Chris Angelico wrote: >>> There is one critical consideration, though. What happens if the >>> directory name contains an underscore or percent sign? Or can you >>> absolutely guarantee that they won't? You may need to escape them, and >>> I'm not sure how SQLite handles that. (Possibly \_ will match literal >>> _, and \\ will match literal \, or something like that.) >> >> I can guarantee that the directory names are all >> '/var/spool/news/message.id/' then 3 digits. (The filenames are >> pretty wild, since they are MIDs.) AIUI, using the '?' substitution >> in the sqlite3 library is supposed to be safe. > > This is nothing to do with question-mark substitution. There are two > separate levels of character significance here - it's like a quoted > string with a regex. Suppose you want to make a regex that searches > for an apostrophe. If you try to define that in a single-quoted > string, you need to escape it: > > regex = '^\'$' > > However, if you ask the user to enter a regex, that wouldn't be necessary: > > regex = input("Enter a pattern: ") # raw_input in Python 2 > Enter a pattern: ^'$ > > This is what the question mark substitution is like - it avoids the > need to carefully manage string delimiters and so on. However, if you > want to make a regex that searches for a backslash, then you need to > escape it, because the backslash is important to the regex itself. In > the same way, the underscore and percent sign are significant to the > LIKE operator. If it were possible to have a directory name with a > percent sign in it, it would match far too much - because you'd > construct a LIKE pattern something like (ahem) > "/var/spool/news/message%20id/142/%" - and as you can see, the percent > sign at the end is no different from the percent sign in the middle. > > But you're safe because you know your data, unrelated to your > substitution method. Possibly merits a comment... but possibly not > worth it. Well, I've changed it to the following anyway. subdir_glob = subdir + '/*' cursor.execute('SELECT filename FROM files WHERE filename GLOB ?', (subdir_glob,)) rows = cursor.fetchall() known_files = {row[0] for row in rows} I see what you mean about paths containing '%', but I don't see why you were concerned about underscores, though. -- You know, there are many people in the country today who, through no fault of their own, are sane. Some of them were born sane. Some of them became sane later in their lives.--― Graham Chapman -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On 2014-07-01, Chris Angelico wrote: > On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk wrote: >> On 2014-07-01, Chris Angelico wrote: >> >>> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk wrote: cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,)) >>> >>> Shouldn't this be an equality check rather than IS, which normally I'd >>> expect to be "IS NULL" or "IS NOT NULL"? >> >> Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', & >> 'IS' are interchangeable. >> >> http://www.tutorialspoint.com/sqlite/sqlite_operators.htm > > Ah, okay. In that case, I'd advise going with either == for > consistency with the rest of Python, or (preferably) = for consistency > with other SQL engines. You wouldn't use "is" to test if two Python > strings are equal, so there's no particular reason to use it here :) I agree. >> Oh, even better: >> >> add_files = listing - known_files >> delete_files = known_files - listing >> >> and then I can remove files that have disappeared off the spool from >> the table. Thanks very much! > > Ah! Didn't know that was a valuable feature for you, but getting that > "for free" is an extra little bonus, so that's awesome! I didn't know it was a valuable feature until I saw that easy way to do it! This will keep the database from growing indefinitely, though. -- '...and Tom [Snyder] turns to him and says, "so Alice [Cooper], is it true you kill chickens on stage?" That was the opening question, and Alice looks at him real serious and goes, "Oh no, no no. That's Colonel Sanders. Colonel Sanders kills chickens."' -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On Wed, Jul 2, 2014 at 1:15 AM, Adam Funk wrote: > On 2014-07-01, Chris Angelico wrote: > >> On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk wrote: >>> cursor.execute('SELECT filename FROM files WHERE filename IS ?', >>> (filename,)) >> >> Shouldn't this be an equality check rather than IS, which normally I'd >> expect to be "IS NULL" or "IS NOT NULL"? > > Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', & > 'IS' are interchangeable. > > http://www.tutorialspoint.com/sqlite/sqlite_operators.htm Ah, okay. In that case, I'd advise going with either == for consistency with the rest of Python, or (preferably) = for consistency with other SQL engines. You wouldn't use "is" to test if two Python strings are equal, so there's no particular reason to use it here :) > Oh, even better: > > add_files = listing - known_files > delete_files = known_files - listing > > and then I can remove files that have disappeared off the spool from > the table. Thanks very much! Ah! Didn't know that was a valuable feature for you, but getting that "for free" is an extra little bonus, so that's awesome! >> There is one critical consideration, though. What happens if the >> directory name contains an underscore or percent sign? Or can you >> absolutely guarantee that they won't? You may need to escape them, and >> I'm not sure how SQLite handles that. (Possibly \_ will match literal >> _, and \\ will match literal \, or something like that.) > > I can guarantee that the directory names are all > '/var/spool/news/message.id/' then 3 digits. (The filenames are > pretty wild, since they are MIDs.) AIUI, using the '?' substitution > in the sqlite3 library is supposed to be safe. This is nothing to do with question-mark substitution. There are two separate levels of character significance here - it's like a quoted string with a regex. Suppose you want to make a regex that searches for an apostrophe. If you try to define that in a single-quoted string, you need to escape it: regex = '^\'$' However, if you ask the user to enter a regex, that wouldn't be necessary: regex = input("Enter a pattern: ") # raw_input in Python 2 Enter a pattern: ^'$ This is what the question mark substitution is like - it avoids the need to carefully manage string delimiters and so on. However, if you want to make a regex that searches for a backslash, then you need to escape it, because the backslash is important to the regex itself. In the same way, the underscore and percent sign are significant to the LIKE operator. If it were possible to have a directory name with a percent sign in it, it would match far too much - because you'd construct a LIKE pattern something like (ahem) "/var/spool/news/message%20id/142/%" - and as you can see, the percent sign at the end is no different from the percent sign in the middle. But you're safe because you know your data, unrelated to your substitution method. Possibly merits a comment... but possibly not worth it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On 2014-07-01, Chris Angelico wrote: > On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk wrote: >> cursor.execute('SELECT filename FROM files WHERE filename IS ?', >> (filename,)) > > Shouldn't this be an equality check rather than IS, which normally I'd > expect to be "IS NULL" or "IS NOT NULL"? Oh, it probably should be in "heavy" SQL. In SQLite, '==', '=', & 'IS' are interchangeable. http://www.tutorialspoint.com/sqlite/sqlite_operators.htm Looking at that page again, I see that 'GLOB' is a case-sensitive version of 'LIKE'. I can't help but wonder if that makes it faster. ;-) > As to your actual question: Your two database lookups are doing > distinctly different things, so there's no surprise that they perform > very differently. B asks the database "Do you have this? Do you have > this?" for every file you have, and C asks the database "What do you > have?", and then comparing that against the list of files. By the way > - the A+C technique could be done quite tidily as a set difference: > > # assume you have listing1 and cursor set up > # as per your above code > listing = {os.path.join(directory, x) for x in listing1} > cursor.execute(...) # as per above > known_files = {row[0] for row in cursor} # cursors are iterable > needed_files = listing - known_files > cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename, > 0) for filename in needed_files)) Oh, even better: add_files = listing - known_files delete_files = known_files - listing and then I can remove files that have disappeared off the spool from the table. Thanks very much! > Anyway. The significant thing is the performance of the database on > two different workloads: either "give me everything that matches this > pattern" (where the pattern ends with a percent sign), or "do you have > this? do you have this? do you have this?". Generally, database > indexing is fairly efficient at handling prefix searches, so the first > query will basically amount to an index search, which is a lot faster > than the repeated separate searching; it takes advantage of the fact > that all the strings you're looking at will have the same prefix. > > There is one critical consideration, though. What happens if the > directory name contains an underscore or percent sign? Or can you > absolutely guarantee that they won't? You may need to escape them, and > I'm not sure how SQLite handles that. (Possibly \_ will match literal > _, and \\ will match literal \, or something like that.) I can guarantee that the directory names are all '/var/spool/news/message.id/' then 3 digits. (The filenames are pretty wild, since they are MIDs.) AIUI, using the '?' substitution in the sqlite3 library is supposed to be safe. > This is not bypassing the database's optimization; in fact, it's > working tidily within it. That's reassuring! ... > But doing the set difference in Python is just as good a way of doing the job. I like it. Thanks very much. -- Specifications are for the weak & timid! --- Klingon Programmer's Guide -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On Tue, Jul 1, 2014 at 10:13 PM, MRAB wrote: > Anyway, I'm sure there's something in SQL for "insert or update" or "on > duplicate", but that's an SQL question, not a Python question. Not in standard SQL, no; there might be in SQLite, as a non-standard extension, but it's a fundamentally hard problem and it has issues. Frankly, though, I doubt the time cost of set operations is anything significant compared to the various queries against the database. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On 2014-07-01 12:26, Adam Funk wrote: I have some code that reads files in a leafnode2 news spool & needs to check for new files periodically. The full paths are all like '/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit subdirectory & a Message-ID for the filename itself. I'm using Python 3 & sqlite3 in the standard library. I have a table of filenames created with the following command: cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used INTEGER)') To check for new files in one of the subdirectories, I run A then either B or C below (I've tried both). A. listing1 = os.listdir(directory) listing [os.path.join(directory, x) for x in listing1] B. cursor = db_conn.cursor() for filename in listing: cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,)) row = cursor.fetchone() if not row: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() C. cursor = db_conn.cursor() subdir_like = directory + '/%' cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', (subdir_like,)) rows = cursor.fetchall() known_files = [row[0] for row in rows] for filename in listing: if filename not in known_files: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() A+B was the first method I came up with, because it looks like the "keep it simple & let the database do its job" approach, but it was very time-consuming, so I tested A+C out. A is quick (a second); B can drag on for over an hour to check 2000 filenames (for example) in a subdirectory; C always takes less than a minute. So C is much better than B, but it looks (to me) like one of those attempts to bypass & ignore the database's built-in optimizations. Comments? In C, 'known_files' is a list, so it performs a linear search for each of the filenames. If you make 'known_files' a set, it'll probably be even faster! Anyway, I'm sure there's something in SQL for "insert or update" or "on duplicate", but that's an SQL question, not a Python question. -- https://mail.python.org/mailman/listinfo/python-list
Re: Searching for lots of similar strings (filenames) in sqlite3 database
On Tue, Jul 1, 2014 at 9:26 PM, Adam Funk wrote: > cursor.execute('SELECT filename FROM files WHERE filename IS ?', > (filename,)) Shouldn't this be an equality check rather than IS, which normally I'd expect to be "IS NULL" or "IS NOT NULL"? As to your actual question: Your two database lookups are doing distinctly different things, so there's no surprise that they perform very differently. B asks the database "Do you have this? Do you have this?" for every file you have, and C asks the database "What do you have?", and then comparing that against the list of files. By the way - the A+C technique could be done quite tidily as a set difference: # assume you have listing1 and cursor set up # as per your above code listing = {os.path.join(directory, x) for x in listing1} cursor.execute(...) # as per above known_files = {row[0] for row in cursor} # cursors are iterable needed_files = listing - known_files cursor.executemany('INSERT INTO files VALUES (?, ?)', ((filename, 0) for filename in needed_files)) Anyway. The significant thing is the performance of the database on two different workloads: either "give me everything that matches this pattern" (where the pattern ends with a percent sign), or "do you have this? do you have this? do you have this?". Generally, database indexing is fairly efficient at handling prefix searches, so the first query will basically amount to an index search, which is a lot faster than the repeated separate searching; it takes advantage of the fact that all the strings you're looking at will have the same prefix. There is one critical consideration, though. What happens if the directory name contains an underscore or percent sign? Or can you absolutely guarantee that they won't? You may need to escape them, and I'm not sure how SQLite handles that. (Possibly \_ will match literal _, and \\ will match literal \, or something like that.) This is not bypassing the database's optimization; in fact, it's working tidily within it. If you want to express your logic in a way that lets the database truly be in command of optimization, it would look something like this: INSERT INTO files SELECT filename,0 FROM (VALUES('foo'),('bar'),('quux'),('spam')) AS needed(filename) EXCEPT SELECT filename,0 FROM files And you'd need to check the SQLite docs for how to define a table that consists of literal values. (The above syntax works in PostgreSQL; I'm pretty sure it's not all standard SQL.) But doing the set difference in Python is just as good a way of doing the job. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Searching for lots of similar strings (filenames) in sqlite3 database
I have some code that reads files in a leafnode2 news spool & needs to check for new files periodically. The full paths are all like '/var/spool/news/message.id/345/<123...@example.com>' with a 3-digit subdirectory & a Message-ID for the filename itself. I'm using Python 3 & sqlite3 in the standard library. I have a table of filenames created with the following command: cursor.execute('CREATE TABLE files (filename TEXT PRIMARY KEY, used INTEGER)') To check for new files in one of the subdirectories, I run A then either B or C below (I've tried both). A. listing1 = os.listdir(directory) listing [os.path.join(directory, x) for x in listing1] B. cursor = db_conn.cursor() for filename in listing: cursor.execute('SELECT filename FROM files WHERE filename IS ?', (filename,)) row = cursor.fetchone() if not row: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() C. cursor = db_conn.cursor() subdir_like = directory + '/%' cursor.execute('SELECT filename FROM files WHERE filename LIKE ?', (subdir_like,)) rows = cursor.fetchall() known_files = [row[0] for row in rows] for filename in listing: if filename not in known_files: cursor.execute('INSERT INTO files VALUES (?, ?)', (filename, 0) ) files_new += 1 db_conn.commit() A+B was the first method I came up with, because it looks like the "keep it simple & let the database do its job" approach, but it was very time-consuming, so I tested A+C out. A is quick (a second); B can drag on for over an hour to check 2000 filenames (for example) in a subdirectory; C always takes less than a minute. So C is much better than B, but it looks (to me) like one of those attempts to bypass & ignore the database's built-in optimizations. Comments? -- No sport is less organized than Calvinball! -- https://mail.python.org/mailman/listinfo/python-list