Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
> Having said that, I've had a bee in my bonnet for a long time about > removing per-row setup cost for repetitive regex matches, and > whatever infrastructure that needs would work for this too. What are the per-row setup costs for regex matches? I looked at `regexp.c` and saw: ``` /* * We cache precompiled regular expressions using a "self organizing list" * structure, in which recently-used items tend to be near the front. * Whenever we use an entry, it's moved up to the front of the list. * Over time, an item's average position corresponds to its frequency of use. * ``` What proverbial bee did you have in your bonnet about the current regex implementation? Which functions other than `strpos` and `LIKE` would benefit from a similar cache, or perhaps a query-scoped cache? In the mean time I'll look at other TODOs that catch my interest. Feel free to point me in the direction of one that you think is both desirable and easy enough for a beginner. Thanks! Karan
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
Yeah, you make a fair point. I was hoping more people would chime in with strong opinions to make the decision easier, but perhaps the lack of noise indicates this is a low-pri feature. I will ask around in my coworker circles to see what they think, could you do the same? Just for the record, I'm leaning to the side of "not worth it". My thoughts are, if you are at a scale where you care about this 20% speedup, you would want to go all the way to an indexed structure, because the gains you would realize would exceed 20%, and 20% may not be a sufficient speedup anyway. On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Robert Haas wrote: > > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka <karanssi...@gmail.com> > wrote: > > > Following the patch to implement strpos with Boyer-Moore-Horspool, > > > it was proposed we bring BMH to LIKE as well. > > > > > > Original thread: > > > > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us > > > > > > I'm a first time hacker and I found this task on the TODO list. It > seemed > > > interesting and isolated enough. But after looking at the code in > > > like_match.c, it's clearly a non-trivial task. > > > > > > How desirable is this feature? To begin answering that question, > > > I did some initial benchmarking with an English text corpus to find > how much > > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the > results > > > were as follows: BMH can be up to 20% faster than LIKE, but for short > > > substrings, it's roughly comparable or slower. > > > > > > Here are the results visualized: > > > > > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png > > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png > > > > Based on these results, this looks to me like a pretty unexciting > > thing upon which to spend time. > > Uh, a 20% different is "unexciting" for you? I think it's interesting. > Now, really, users shouldn't be running LIKE on constant strings all the > time but rather use some sort of indexed search, but once in a while > there is a need to run some custom query and you need to LIKE-scan a > large portion of a table. For those cases an algorithm that performs > 20% better is surely welcome. > > I wouldn't be so quick to dismiss this. > > Of course, it needs to work in all cases, or failing that, be able to > fall back to the original code if it cannot support some corner case. > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
I agree, should we remove it from the TODO list? On Mon, Aug 1, 2016 at 6:13 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka <karanssi...@gmail.com> wrote: > > Following the patch to implement strpos with Boyer-Moore-Horspool, > > it was proposed we bring BMH to LIKE as well. > > > > Original thread: > > > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us > > > > I'm a first time hacker and I found this task on the TODO list. It seemed > > interesting and isolated enough. But after looking at the code in > > like_match.c, it's clearly a non-trivial task. > > > > How desirable is this feature? To begin answering that question, > > I did some initial benchmarking with an English text corpus to find how > much > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the > results > > were as follows: BMH can be up to 20% faster than LIKE, but for short > > substrings, it's roughly comparable or slower. > > > > Here are the results visualized: > > > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png > > Based on these results, this looks to me like a pretty unexciting > thing upon which to spend time. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
[HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries
Hi pgsql-hackers, Following the patch to implement strpos with Boyer-Moore-Horspool, it was proposed we bring BMH to LIKE as well. Original thread: https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us I'm a first time hacker and I found this task on the TODO list. It seemed interesting and isolated enough. But after looking at the code in like_match.c, it's clearly a non-trivial task. How desirable is this feature? To begin answering that question, I did some initial benchmarking with an English text corpus to find how much faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results were as follows: BMH can be up to 20% faster than LIKE, but for short substrings, it's roughly comparable or slower. Here are the results visualized: http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png Data attached, and description of the benchmark below. I'd love to hear your thoughts: - is the benchmark valid? am I missing anything? - what conclusions can we draw from the results? - what action should we take from here? - is this the right mailing list for this discussion? Thank you! - Karan, pg community newbie --- Benchmark Details --- The easiest way to approximate the potential speedup from BMH, is to compare the performance of the following queries: 1. SELECT count(*) FROM test_table WHERE text LIKE '%substring%'; 2. SELECT count(*) FROM test_table WHERE strpos('substring', text) > 0; We expect the strpos query to outperform the LIKE query since strpos is implemented with BMH. I loaded up the database with chunks of english text from the bible, a commonly used corpus for simple text analysis. The exact procedure is described in more detail at the end of the email. I ran a few queries, using short substrings and long substrings, the choice of which is discussed in more detail below. ## Choice of test data BMH is known to be particularly fast on english text, so I loaded the test table with 5k-character chunks of text from the bible. BMH is expected to be slower for small substrings where the overhead of creating a skip table may not be justified. For larger substrings, BMH may outperform LIKE due to the skip table. In the best case, if a text character does not exist in the substring, the substring can jump length-of-substring characters, skipping what would be a lot of work. I chose small (< 15 character) substrings to be the most popular bigrams and trigrams in the text. I chose long (10-250 character) substrings at random, and took varied-length prefixes and suffixes to see how it affected algorithm performance. The reason that prefixes were not sufficient is that BMH compares from the right end of the substring, unlike the LIKE algorithm. The full strings are included in the attached excel file. ## Database setup Download the corpus (englishTexts) from http://www.dmi.unict.it/~faro/smart/corpus.php Create the table: CREATE TABLE test_table (text text); Generate the rows for insertion (generates chunks of 5k characters): wget http://ctrl-c.club/~ksikka/pg/gen_rows.py python gen_rows.py path/to/englishTexts/bible.txt > path/to/output/file Load the table: COPY test_table (text) from '/absolute/path/to/previous/output/file' WITH ENCODING 'UTF-8'; I ran the COPY command 21 times to exacerbate performance. Queries were timed with psql's \timing feature. The mean of five queries was reported. ## Hardware Apple Macbook Air (Mid 2012) CPU: 1.8 GHz Intel Core i5 RAM: 4 GB 1600 MHz DDR3 import io import re import sys import random ## Configuration ## CHUNK_LEN = 5000 # *fix, ie prefix, suffix. starfix_lengths = [ 10, 50, 90, 130, 170, 210, 250 ] def copyEscape(ustring): """ From the COPY docs: Backslash characters (\) can be used in the COPY data to quote data characters that might otherwise be taken as row or column delimiters. In particular, the following characters must be preceded by a backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the current delimiter character. :LINK: https://www.postgresql.org/docs/9.1/static/sql-copy.html """ return re.sub(r'([\\\r\n\t])', r'\\\1', ustring, flags=re.UNICODE) if __name__ == '__main__': input_filename = sys.argv[1] input_stream = io.open(input_filename, mode='r', encoding='Latin-1') prefix_mode = False suffix_mode = False if sys.argv[2] == '--prefix': prefix_mode = True if sys.argv[2] == '--suffix': suffix_mode = True chunks = [] chunk = None while chunk is None or len(chunk) == CHUNK_LEN: chunk = input_stream \ .read(CHUNK_LEN) if chunk == '': break chunks.append(chunk) if prefix_mode or suffix_mode: