Re: Patch to change murmurhash implementation slightly

2023-08-26 Thread Thomas Dullien
Hey Marcus, Thank you for your reply. I'll work on Lucenebench and report back, I'll also check if adding a benchmark for a long-utf-8-word language makes the effect more visible. I will update this thread next week then :-) Have a good weekend y'all :-) Cheers, Thomas Marcus Eagan schrieb

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Yonik Seeley
On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien wrote: > apologies if the chart is incorrect. The chart isn't necessarily incorrect, but it probably isn't the most relevant statistic here. "Lies, damn lies, and statistics" ;-) The average length of unique English words is not the same as the

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Marcus Eagan
Thomas, Also, is it possible to open this patch as a pull request in GitHub? I guess it does not matter for a lot of the people here. It would make it easier for more people to collaborate in that medium given the shift to GitHub recently. - Marcus On Fri, Aug 25, 2023 at 7:03 PM Marcus

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Marcus Eagan
Hi Thomas, Thank you for the hard work thus far. I'm excited to see if the community can benefit from the work. The best way to use the lucene bench is to run the baseline and candidate branches as described here . I

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Thomas Dullien
Hey all, apologies if the chart is incorrect. Anyhow, I think the more important questions are: 1) Which benchmarks does the Lucene community have that y'all would like to see an improvement on before accepting this (or any other future) performance patches? I'm guessing the reason why the

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Robert Muir
chart is wrong, average word length for english is like 5. On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien wrote: > > Hey all, > > another data point: There's a diagram with the relevant distributions of word > lengths in various languages here: > >

Re: Patch to change murmurhash implementation slightly

2023-08-25 Thread Thomas Dullien
Hey all, another data point: There's a diagram with the relevant distributions of word lengths in various languages here: https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/ While English is close to the 8-byte limit, average word length

Re: Patch to change murmurhash implementation slightly

2023-08-24 Thread Thomas Dullien
Hey there, reviving this thread. To clarify: In order to show this patch is worth doing, I should index a bunch of natural-language documents (whichever language that is) and show that the patch brings a performance benefit? (Just clarifying, because at least inside ElasticSearch for the logs

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Walter Underwood
I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, Hebrew) will have longer byte strings because of UTF8. German has large compound words. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 25, 2023, at 10:57 AM, Thomas

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey all, ok, attached is a second patch that adds some unit tests; I am happy to add more. This brings me back to my original question: I'd like to run some pretty thorough benchmarking on Lucene, both for this change and for possible other future changes, largely focused on indexing

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey, ok, I've done some digging: Unfortunately, MurmurHash3 does not publish official test vectors, see the following URLs: https://github.com/aappleby/smhasher/issues/6 https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 There is a link to a pastebin entry in the first

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
i dont think we need a ton of random strings. But if you want to optimize for strings of length 8, at a minimum there should be very simple tests ensuring correctness for some boundary conditions (e.g. string of length exactly 8). i would also strongly recommend testing non-ascii since java is a

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey, I offered to run a large number of random-string-hashes to ensure that the output is the same pre- and post-change. I can add an arbitrary number of such tests to TestStringHelper.java, just specify the number you wish. If your worry is that my change breaches the inlining bytecode limit:

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
There is literally one string, all-ascii. This won't fail if all the shifts and masks are wrong. About the inlining, i'm not talking about cpu stuff, i'm talking about java. There are limits to the size of methods that get inlined (e.g. -XX:MaxInlineSize). If we make this method enormous like

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey, so there are unit tests in TestStringHelper.java that test strings of length greater than 8, and my change passes them. Could you explain what you want tested? Cheers, Thomas On Tue, Apr 25, 2023 at 4:21 PM Robert Muir wrote: > sure, but "if length > 8 return 1" might pass these same

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
sure, but "if length > 8 return 1" might pass these same tests too, yet cause a ton of hash collisions. I just think if you want to optimize for super-long strings, there should be a unit test. On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien wrote: > > Hey, > > I am pretty confident about

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey, I am pretty confident about correctness. The change passes both Lucene and ES regression tests and my careful reading of the code is pretty certain that the output is the same. If you want me to randomly test the result for a few hundred million random strings, I'm happy to do that, too, if

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
well there is some cost, as it must add additional checks to see if its longer than 8. in your patch, additional loops. it increases the method size and may impact inlining and other things. also we can't forget about correctness, if the hash function does the wrong thing it could slow everything

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Ah, I see what you mean. You are correct -- the change will not speed up a 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the shorter words. On Tue, Apr 25, 2023 at 3:20 PM Robert Muir wrote: > if a word is of length 5, processing 8 bytes at a time isn't going to > speed

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
if a word is of length 5, processing 8 bytes at a time isn't going to speed anything up. there aren't 8 bytes to process. On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien wrote: > > Is average word length <= 4 realistic though? I mean, even the english wiki > corpus has ~5, which would require

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Is average word length <= 4 realistic though? I mean, even the english wiki corpus has ~5, which would require two calls to the lucene layer instead of one; e.g. multiple layers of virtual dispatch that are unnecessary? You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
i think from my perspective it has nothing to do with cpus being 32-bit or 64-bit and more to do with the average length of terms in most languages being smaller than 8. for the languages with longer word length, its usually because of complex morphology that most users would stem away. so doing 4

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Michael McCandless
For a truly "pure" indexing test I usually use a single thread for indexing, and SerialMergeScheduler (using that single thread to also do single-threaded merging). It makes the indexing take forever lol but it produces "comparable" results. But ... this sounds like a great change anyway? Do we

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Robert Muir
I think the results of the benchmark will depend on the properties of the indexed terms. For english wikipedia (luceneutil) the average word length is around 5 bytes so this optimization may not do much. On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai wrote: > > I did a quick run with your patch,

Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Thomas Dullien
Hey, I know nothing about what comparisons are fair or not :-). Could you share a command line for running indexing benchmarks? That'd already get me started... Cheers, Thomas On Tue, 25 Apr 2023, 07:58 Patrick Zhai, wrote: > I did a quick run with your patch, but since I turned on the CMS

Re: Patch to change murmurhash implementation slightly

2023-04-24 Thread Patrick Zhai
I did a quick run with your patch, but since I turned on the CMS as well as TieredMergePolicy I'm not sure how fair the comparison is. Here's the result: Candidate: Indexer: indexing done (890209 msec); total 2620 docs Indexer: waitForMerges done (71622 msec) Indexer: finished (961877 msec)

Patch to change murmurhash implementation slightly

2023-04-24 Thread Thomas Dullien
Hey all, I've been experimenting with fixing some low-hanging performance fruit in the ElasticSearch codebase, and came across the fact that the MurmurHash implementation that is used by ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely an artifact from 32-bit