Hey,
I know nothing about what comparisons are fair or not :-).
Could you share a command line for running indexing benchmarks? That'd
already get me started...
Cheers,
Thomas
On Tue, 25 Apr 2023, 07:58 Patrick Zhai, wrote:
> I did a quick run with your patch, but since I turned on the CMS
Hey all,
ok, attached is a second patch that adds some unit tests; I am happy to add
more.
This brings me back to my original question: I'd like to run some pretty
thorough benchmarking on Lucene, both for this change and for possible
other future changes, largely focused on indexing
I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic,
Hebrew) will have longer byte strings because of UTF8. German has large
compound words.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
> On Apr 25, 2023, at 10:57 AM, Thomas
Hey,
ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
official test vectors, see the following URLs:
https://github.com/aappleby/smhasher/issues/6
https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
There is a link to a pastebin entry in the first
On Sun, Apr 23, 2023 at 6:19 AM Uwe Schindler wrote:
Having the sequence number public in API does not bring any benefit, as
> you cannot use it for anything.
>
Actually there are some interesting use cases for sequence numbers:
They enable the caller to know the effective order of operations
sure, but "if length > 8 return 1" might pass these same tests too,
yet cause a ton of hash collisions.
I just think if you want to optimize for super-long strings, there
should be a unit test.
On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
wrote:
>
> Hey,
>
> I am pretty confident about
Ah, I see what you mean.
You are correct -- the change will not speed up a 5-byte word, but it
*will* speed up all 8+-byte words, at no cost to the shorter words.
On Tue, Apr 25, 2023 at 3:20 PM Robert Muir wrote:
> if a word is of length 5, processing 8 bytes at a time isn't going to
> speed
I think the results of the benchmark will depend on the properties of
the indexed terms. For english wikipedia (luceneutil) the average word
length is around 5 bytes so this optimization may not do much.
On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai wrote:
>
> I did a quick run with your patch,
Is average word length <= 4 realistic though? I mean, even the english wiki
corpus has ~5, which would require two calls to the lucene layer instead of
one; e.g. multiple layers of virtual dispatch that are unnecessary?
You're not going to pay any cycles for reading 8 bytes instead of 4 bytes,
so
well there is some cost, as it must add additional checks to see if
its longer than 8. in your patch, additional loops. it increases the
method size and may impact inlining and other things. also we can't
forget about correctness, if the hash function does the wrong thing it
could slow everything
There is literally one string, all-ascii. This won't fail if all the
shifts and masks are wrong.
About the inlining, i'm not talking about cpu stuff, i'm talking about
java. There are limits to the size of methods that get inlined (e.g.
-XX:MaxInlineSize). If we make this method enormous like
Hey,
I am pretty confident about correctness. The change passes both Lucene and
ES regression tests and my careful reading of the code is pretty certain
that the output is the same. If you want me to randomly test the result for
a few hundred million random strings, I'm happy to do that, too, if
Hey,
I offered to run a large number of random-string-hashes to ensure that the
output is the same pre- and post-change. I can add an arbitrary number of
such tests to TestStringHelper.java, just specify the number you wish.
If your worry is that my change breaches the inlining bytecode limit:
For a truly "pure" indexing test I usually use a single thread for
indexing, and SerialMergeScheduler (using that single thread to also do
single-threaded merging). It makes the indexing take forever lol but it
produces "comparable" results.
But ... this sounds like a great change anyway? Do we
if a word is of length 5, processing 8 bytes at a time isn't going to
speed anything up. there aren't 8 bytes to process.
On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
wrote:
>
> Is average word length <= 4 realistic though? I mean, even the english wiki
> corpus has ~5, which would require
i dont think we need a ton of random strings. But if you want to
optimize for strings of length 8, at a minimum there should be very
simple tests ensuring correctness for some boundary conditions (e.g.
string of length exactly 8). i would also strongly recommend testing
non-ascii since java is a
i think from my perspective it has nothing to do with cpus being
32-bit or 64-bit and more to do with the average length of terms in
most languages being smaller than 8. for the languages with longer
word length, its usually because of complex morphology that most users
would stem away. so doing 4
Hey,
so there are unit tests in TestStringHelper.java that test strings of
length greater than 8, and my change passes them. Could you explain what
you want tested?
Cheers,
Thomas
On Tue, Apr 25, 2023 at 4:21 PM Robert Muir wrote:
> sure, but "if length > 8 return 1" might pass these same
I think Apache Solr could explore leveraging the returned sequence number
for its transaction logs.
On Tue, 25 Apr 2023 at 18:36, Michael McCandless
wrote:
> On Sun, Apr 23, 2023 at 6:19 AM Uwe Schindler wrote:
>
> Having the sequence number public in API does not bring any benefit, as
>> you
19 matches
Mail list logo