Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
I didn't mean for Peter to write both backends but perhaps, if he's
experimenting already anyway, make it possible to extract an interface
which could be substituted externally with different implementations. Makes
it easier to tinker with various options, even for us.

D.

On Thu, Feb 11, 2021 at 1:16 AM Robert Muir  wrote:

> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss  wrote:
> > Maybe the "backend" could be configurable somehow so that you could
> change the strategy depending on your needs?... I haven't looked at how
> FSTs are used but if can be hidden behind a facade then an alternative
> implementation could be provided depending on one's need?
> >
> > D.
> >
>
> I don't have any confidence that solr would default to the "smaller"
> option or fix how they manage different solr cores or thousands of
> threads or any of the analyzer issues. And who would maintain this
> separate hunspell backend? I don't think it is fair to Peter to have
> to cope with 2 implementations of hunspell, 1 is certainly enough...
> :). It's all apache license, at the end of the day if someone wants to
> step up, let 'em. otherwise let's get out of their way.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Hunspell performance

2021-02-10 Thread Robert Muir
On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss  wrote:
> Maybe the "backend" could be configurable somehow so that you could change 
> the strategy depending on your needs?... I haven't looked at how FSTs are 
> used but if can be hidden behind a facade then an alternative implementation 
> could be provided depending on one's need?
>
> D.
>

I don't have any confidence that solr would default to the "smaller"
option or fix how they manage different solr cores or thousands of
threads or any of the analyzer issues. And who would maintain this
separate hunspell backend? I don't think it is fair to Peter to have
to cope with 2 implementations of hunspell, 1 is certainly enough...
:). It's all apache license, at the end of the day if someone wants to
step up, let 'em. otherwise let's get out of their way.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Trouble with building PyLucene on Mac

2021-02-10 Thread Andi Vajda



 Hi Clem,

Lots of replies inline...

On Wed, 10 Feb 2021, Wang, Clem wrote:

(My msg originally post here: 
https://issues.apache.org/jira/projects/PYLUCENE/issues/PYLUCENE-10 but 
Andreas Vajda said I should send to the mailing list.  I missed whatever 
he had posted to the mailing list previously )


For earlier posts, you may refer to the mailing list archives:
  https://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/

I had a lot of trouble with building PyLucene.  I finally got it built 
(under Python 2.7) with much gnashing of teeth but I'm not sure I can get 
it to run under 2.7 (I'm not going to bother describing that since I don't 
think it will be useful.) I'd rather try to get PyLucene built and running 
under Python 3.x.


Please, don't complain or gnash your teeth, PyLucene is a C++/Java/Python 
extension and knowledge on how to operate a C++ compiler and linker is 
required. No excuses. The PyLucene build uses a Makefile with example 
configurations that you need to modify to satisfy your environment's 
constraints. Before building PyLucene, you need to build JCC, which uses a 
setup.py file where, again, depending on your setup (Java, in particular), 
you need to set things that fit your installation.


I suspect some of this has to do with lack of support from Apple and 
Oracle for Java as well as half-hearted support between gcc & clang 
headers, libraries, and flags, so I'm not even sure how much 
responsibility falls onto PyLucene.


No amount of blaming others is going to spare you the need to know 
how to operate a C++ compiler and linker competently.


Last I checked, Apple didn't include a version of Python 3 in their OS, so, 
if you wish to run PyLucene with Python 3, you need to start by installing 
Python 3, either from sources or from a binary distribution of your choice.

I suggest starting with sources at python.org.

For me, it would be preferable if I could just download a binary (although 
I don't know how difficult that would be).


I can't distribute binaries. There are too many combinations of binaries 
to build based on all the OS, Python, Java and Lucene versions. I also 
refuse to take responsibility in shipping binaries, as I can't vouch for 
them. What if they contain a virus ? You can inspect all the sources, 
however.
PyLucene is open source and you are expected to be able to build it from 
sources without too much trouble and debug issues related to your 
environment choices yourself. You are also welcome to ask questions here and 
we can help you, but complaining about things or gnashing teeth is going to 
waste the goodwill of readers of this list.



My configuration:

 *   pylucene-8.6.1 gotten from 
https://mirrors.ocf.berkeley.edu/apache/lucene/pylucene/

 *   Mac OSX 10.15.7
 *   Macbook Pro, Intel Core i7
 *   gcc --version   gcc (Homebrew GCC 10.2.0_3) 10.2.0
 *   clang --version
*   Apple clang version 12.0.0 (clang-1200.0.32.29)
*   Target: x86_64-apple-darwin19.6.0
*   Thread model: posix


You must use the same compiler that was used to build the Python version of 
your choise. If you installed binaries from python.org, that is likely to 
be clang from Apple's command line dev tools.
If it's from homebrew, then probably a homebrew compiler. But, by all means, 
do not mix homebrew and non-homebrew things.


I recommend you take the homebrew gcc compiler off your PATH and ensure you 
have the Apple Clang from the command line dev tools installed and 
accessible on your PATH by default (or from XCode) and build Python 3 from 
sources. See xcodeselect for more information.

  https://developer.apple.com/library/archive/technotes/tn2339/_index.html

These are the steps I used to configure and build Python 3.9.1 a couple of 
weeks ago after downloading the sources from python.org:


  step 0: build and install libressl from sources
$ curl -O 
https://ftp.openbsd.org/pub/OpenBSD/LibreSSL/libressl-.tar.gz
$ tar -xvzf ~/tmp/downloads/libressl-.tar.gz
$ cd libressl-
$ ./configure --prefix=/usr/local/libressl
$ make
$ sudo make install

  step 1:
 download python 3 sources from python.org
 unpack the archive
 $ ./configure --prefix=`pwd`/_install --enable-framework=`pwd`/_framework 
--with-openssl=/usr/local/libressl
 $ make
 $ make install
   (python is now installed in `pwd`/_install, you may of course, choose
to install it anywhere)
 I also strongly recommend you setup a virtual environment for PyLucene
 at this time.

Now that you have python 3 installed, onto JCC
 step 0: ensure you have a Java JDK installed
 step 1: ensure JCC's setup.py is properly configured for your setup
 $ `pwd`/_install/bin/python setup.py build install
(or use the path to python in your virtual env, if you set one up)

Once JCC built installed, edit PyLucene's Makefile to fit your environment
as well.
 $ make
 $ make test
 $ make install


 *   Python 3.8.2
 *   

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
This has been resolved.

Thanks to everyone who helped :)

On Wed, Feb 10, 2021 at 12:36 PM Anshum Gupta 
wrote:

> Can you elaborate more around this? I was also trying to see if I could
> just create  a PR to merge production -> master, but that would just mess
> up the history. It will bring the code in sync but I'm also not sure if
> that would fix the larger problem.
>
> On Wed, Feb 10, 2021 at 12:01 PM Michael Sokolov 
> wrote:
>
>> Have you considered using a merge commit for this? That won't require
>> force pushing
>>
>> On Wed, Feb 10, 2021 at 2:51 PM Anshum Gupta 
>> wrote:
>> >
>> > Hi All,
>> >
>> > Seems like during the last release, we directly committed the website
>> changes to the production branch, bypassing the master. This is now causing
>> issues with merging updates from master into prod using the simple 'create
>> PR' -> 'merge master to prod' workflow.
>> >
>> > I was working with Cassandra to clean this up but we'd need help from
>> someone who's more confident and experienced with such GitHub issues.
>> >
>> > I tried rebasing production to master in hopes that we'll get the
>> missing commits correctly into master, but that seems to warn of the
>> diverging branches and requires a force push, something I wasn't
>> comfortable doing without another set of eyes :)
>> >
>> > If you have suggestions or know what to do here, please help with
>> fixing the branch.
>> >
>> > --
>> > Anshum Gupta
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Anshum Gupta
>


-- 
Anshum Gupta


Re: Hunspell performance

2021-02-10 Thread Gus Heck
+1 to configurability that is well documented, and reasonably actionable
downstream in Solr... Some folks struggle with the costs of buying machines
with lots of memory.

On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss  wrote:

>
>
>> To me the challenge with such a change is just trying to prevent
>
> strange dictionaries from blowing up to 30x the space :)
>>
>
> Maybe the "backend" could be configurable somehow so that you could change
> the strategy depending on your needs?... I haven't looked at how FSTs are
> used but if can be hidden behind a facade then an alternative
> implementation could be provided depending on one's need?
>
> D.
>
>
>>
>> On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
>>  wrote:
>> >
>> > I was hoping for some numbers :) In the meantime, I've got some of my
>> own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
>> (there's more, but I ignored dialects of the same base language). Together
>> they currently consume a humble 166MB. With one of my less memory-hungry
>> approaches, they'd take ~500MB (maybe less if I optimize, but probably not
>> significantly). Is this very bad or tolerable for, say, 50% speedup?
>> >
>> > I've seen huge *.aff files, and I'm planning to do something with affix
>> FSTs, too. They take some noticeable time, too, but much less than *.dic-s
>> one, so for now I concentrate on *.dic.
>> >
>> > > Sure, but 20% of those linear scans are maybe 7x slower
>> >
>> > Checked that. The distribution appears to be decreasing monotonically.
>> No linear scans are longer than 8, and ~85% of all linear scans end after
>> no more than 1 miss.
>> >
>> > I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
>> complicated :(
>> >
>> > On Wed, Feb 10, 2021 at 5:04 PM Robert Muir  wrote:
>> >>
>> >> Peter, looks like you are way ahead of me :) Thanks for all the work
>> >> you have been doing here, and thanks to Dawid for helping!
>> >>
>> >> You probably know a lot of this code better than me at this point, but
>> >> I remember a couple of these pain points, inline below:
>> >>
>> >> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>> >>  wrote:
>> >> >
>> >> > Hi Robert,
>> >> >
>> >> > Yes, having multiple dictionaries in the same process would increase
>> the memory significantly. Do you have any idea about how many of them
>> people are loading, and how much memory they give to Lucene?
>> >>
>> >> Yeah in many cases, the user is using a server such as solr or
>> elasticsearch.
>> >> Let's use solr as an example, as others are here to correct it, if I
>> am wrong.
>> >>
>> >> Example to understand the challenges: user uses one of solr's 3
>> >> mechanisms to detect language and send to different pipeline:
>> >>
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> >> Now we know these language detectors are imperfect, if the user maps a
>> >> lot of languages to hunspell pipelines, they may load lots of
>> >> dictionaries, even by just one stray miscategorized document.
>> >> So it doesn't have to be some extreme "enterprise" use-case like
>> >> wikipedia.org, it can happen for a little guy faced with a
>> >> multilingual corpus.
>> >>
>> >> Imagine the user decides to go further, and host solr search in this
>> >> way for a couple local businesses or govt agencies.
>> >> They support many languages and possibly use this detection scheme
>> >> above to try to make language a "non-issue".
>> >> The user may assign each customer a solr "core" (separate index) with
>> >> this configuration.
>> >> Does each solr core load its own HunspellStemFactory? I think it might
>> >> (in isolated classloader), I could be wrong.
>> >>
>> >> For the elasticsearch case, maybe the resource usage in the same case
>> >> is lower, because they reuse dictionaries per-node?
>> >> I think this is how it works, but I honestly can't remember.
>> >> Still the problem remains, easy to end up with dozens of these things
>> in memory.
>> >>
>> >> Also we have the problem that memory usage for a specific can blow up
>> >> in several ways.
>> >> Some languages have bigger .aff file than .dic!
>> >>
>> >> > Thanks for the idea about root arcs. I've done some quick sampling
>> and tracing (for German). 80% of root arc processing time is spent in
>> direct addressing, and the remainder is linear scan (so root acrs don't
>> seem to present major issues). For non-root arcs, ~50% is directly
>> addressed, ~45% linearly-scanned, and the remainder binary-searched.
>> Overall there's about 60% of direct addressing, both in time and invocation
>> counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4
>> inputs are used. Reducing that might increase the number of directly
>> addressed arcs, but I'm not sure that'd speed up much given that time and
>> invocation counts seem to correlate.
>> >> >
>> >>
>> >> Sure, but 20% of those linear scans are maybe 7x slower, its
>> >> O(log2(alphabet_size)) right (assuming alphabet size ~ 

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
Can you elaborate more around this? I was also trying to see if I could
just create  a PR to merge production -> master, but that would just mess
up the history. It will bring the code in sync but I'm also not sure if
that would fix the larger problem.

On Wed, Feb 10, 2021 at 12:01 PM Michael Sokolov  wrote:

> Have you considered using a merge commit for this? That won't require
> force pushing
>
> On Wed, Feb 10, 2021 at 2:51 PM Anshum Gupta 
> wrote:
> >
> > Hi All,
> >
> > Seems like during the last release, we directly committed the website
> changes to the production branch, bypassing the master. This is now causing
> issues with merging updates from master into prod using the simple 'create
> PR' -> 'merge master to prod' workflow.
> >
> > I was working with Cassandra to clean this up but we'd need help from
> someone who's more confident and experienced with such GitHub issues.
> >
> > I tried rebasing production to master in hopes that we'll get the
> missing commits correctly into master, but that seems to warn of the
> diverging branches and requires a force push, something I wasn't
> comfortable doing without another set of eyes :)
> >
> > If you have suggestions or know what to do here, please help with fixing
> the branch.
> >
> > --
> > Anshum Gupta
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Anshum Gupta


Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
> To me the challenge with such a change is just trying to prevent

strange dictionaries from blowing up to 30x the space :)
>

Maybe the "backend" could be configurable somehow so that you could change
the strategy depending on your needs?... I haven't looked at how FSTs are
used but if can be hidden behind a facade then an alternative
implementation could be provided depending on one's need?

D.


>
> On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
>  wrote:
> >
> > I was hoping for some numbers :) In the meantime, I've got some of my
> own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
> (there's more, but I ignored dialects of the same base language). Together
> they currently consume a humble 166MB. With one of my less memory-hungry
> approaches, they'd take ~500MB (maybe less if I optimize, but probably not
> significantly). Is this very bad or tolerable for, say, 50% speedup?
> >
> > I've seen huge *.aff files, and I'm planning to do something with affix
> FSTs, too. They take some noticeable time, too, but much less than *.dic-s
> one, so for now I concentrate on *.dic.
> >
> > > Sure, but 20% of those linear scans are maybe 7x slower
> >
> > Checked that. The distribution appears to be decreasing monotonically.
> No linear scans are longer than 8, and ~85% of all linear scans end after
> no more than 1 miss.
> >
> > I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
> complicated :(
> >
> > On Wed, Feb 10, 2021 at 5:04 PM Robert Muir  wrote:
> >>
> >> Peter, looks like you are way ahead of me :) Thanks for all the work
> >> you have been doing here, and thanks to Dawid for helping!
> >>
> >> You probably know a lot of this code better than me at this point, but
> >> I remember a couple of these pain points, inline below:
> >>
> >> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
> >>  wrote:
> >> >
> >> > Hi Robert,
> >> >
> >> > Yes, having multiple dictionaries in the same process would increase
> the memory significantly. Do you have any idea about how many of them
> people are loading, and how much memory they give to Lucene?
> >>
> >> Yeah in many cases, the user is using a server such as solr or
> elasticsearch.
> >> Let's use solr as an example, as others are here to correct it, if I am
> wrong.
> >>
> >> Example to understand the challenges: user uses one of solr's 3
> >> mechanisms to detect language and send to different pipeline:
> >>
> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
> >> Now we know these language detectors are imperfect, if the user maps a
> >> lot of languages to hunspell pipelines, they may load lots of
> >> dictionaries, even by just one stray miscategorized document.
> >> So it doesn't have to be some extreme "enterprise" use-case like
> >> wikipedia.org, it can happen for a little guy faced with a
> >> multilingual corpus.
> >>
> >> Imagine the user decides to go further, and host solr search in this
> >> way for a couple local businesses or govt agencies.
> >> They support many languages and possibly use this detection scheme
> >> above to try to make language a "non-issue".
> >> The user may assign each customer a solr "core" (separate index) with
> >> this configuration.
> >> Does each solr core load its own HunspellStemFactory? I think it might
> >> (in isolated classloader), I could be wrong.
> >>
> >> For the elasticsearch case, maybe the resource usage in the same case
> >> is lower, because they reuse dictionaries per-node?
> >> I think this is how it works, but I honestly can't remember.
> >> Still the problem remains, easy to end up with dozens of these things
> in memory.
> >>
> >> Also we have the problem that memory usage for a specific can blow up
> >> in several ways.
> >> Some languages have bigger .aff file than .dic!
> >>
> >> > Thanks for the idea about root arcs. I've done some quick sampling
> and tracing (for German). 80% of root arc processing time is spent in
> direct addressing, and the remainder is linear scan (so root acrs don't
> seem to present major issues). For non-root arcs, ~50% is directly
> addressed, ~45% linearly-scanned, and the remainder binary-searched.
> Overall there's about 60% of direct addressing, both in time and invocation
> counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4
> inputs are used. Reducing that might increase the number of directly
> addressed arcs, but I'm not sure that'd speed up much given that time and
> invocation counts seem to correlate.
> >> >
> >>
> >> Sure, but 20% of those linear scans are maybe 7x slower, its
> >> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
> >> Hard to reason about, but maybe worth testing out. It still helps for
> >> all the other segmenters (japanese, korean) using fst.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: 

Re: Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Michael Sokolov
Have you considered using a merge commit for this? That won't require
force pushing

On Wed, Feb 10, 2021 at 2:51 PM Anshum Gupta  wrote:
>
> Hi All,
>
> Seems like during the last release, we directly committed the website changes 
> to the production branch, bypassing the master. This is now causing issues 
> with merging updates from master into prod using the simple 'create PR' -> 
> 'merge master to prod' workflow.
>
> I was working with Cassandra to clean this up but we'd need help from someone 
> who's more confident and experienced with such GitHub issues.
>
> I tried rebasing production to master in hopes that we'll get the missing 
> commits correctly into master, but that seems to warn of the diverging 
> branches and requires a force push, something I wasn't comfortable doing 
> without another set of eyes :)
>
> If you have suggestions or know what to do here, please help with fixing the 
> branch.
>
> --
> Anshum Gupta

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Help needed with fixing lucene-site GitHub repo

2021-02-10 Thread Anshum Gupta
Hi All,

Seems like during the last release, we directly committed the website
changes to the production branch, bypassing the master. This is now causing
issues with merging updates from master into prod using the simple 'create
PR' -> 'merge master to prod' workflow.

I was working with Cassandra to clean this up but we'd need help from
someone who's more confident and experienced with such GitHub issues.

I tried rebasing production to master in hopes that we'll get the missing
commits correctly into master, but that seems to warn of the diverging
branches and requires a force push, something I wasn't comfortable doing
without another set of eyes :)

If you have suggestions or know what to do here, please help with fixing
the branch.

-- 
Anshum Gupta


Re: 8.8.1 release soon

2021-02-10 Thread Anshum Gupta
Thanks for taking care of this, Tim.

I've added a note to the 'downloads' page so folks who head there know
about this issue and that a release with the fix is in the works. (thanks
for reviewing that too :) )

-Anshum

On Wed, Feb 10, 2021 at 7:37 AM Timothy Potter  wrote:

> I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it
> seems we have no automated SolrJ back-compat tests in our RC vetting
> process, so unfortunately older SolrJ clients don't work with Solr 8.8
> server, see SOLR-15145.
>
> I'd like to release 8.8.1 ASAP to address this problem and will be the RM.
>
> Let me know if you have any other issues you think need to go into 8.8.1,
> otherwise I'd like to build an RC tomorrow AM US time. It looks like there
> are already a number of updates going in for 8.9 so let's keep the updates
> for 8.8.1 to a minimum please.
>
> Cheers,
> Tim
>


-- 
Anshum Gupta


Re: Hunspell performance

2021-02-10 Thread Robert Muir
50% speedup for the HunspellStemmer use case? for 3x the memory space?

Just my opinion: Seems like the correct tradeoff to me.
Analysis chain is a serious bottleneck for indexing speed: this
hunspell is one of the slower ones.

To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
 wrote:
>
> I was hoping for some numbers :) In the meantime, I've got some of my own. I 
> loaded 90 dictionaries from https://github.com/wooorm/dictionaries (there's 
> more, but I ignored dialects of the same base language). Together they 
> currently consume a humble 166MB. With one of my less memory-hungry 
> approaches, they'd take ~500MB (maybe less if I optimize, but probably not 
> significantly). Is this very bad or tolerable for, say, 50% speedup?
>
> I've seen huge *.aff files, and I'm planning to do something with affix FSTs, 
> too. They take some noticeable time, too, but much less than *.dic-s one, so 
> for now I concentrate on *.dic.
>
> > Sure, but 20% of those linear scans are maybe 7x slower
>
> Checked that. The distribution appears to be decreasing monotonically. No 
> linear scans are longer than 8, and ~85% of all linear scans end after no 
> more than 1 miss.
>
> I'll try BYTE1 if I manage to do it. It turned out to be surprisingly 
> complicated :(
>
> On Wed, Feb 10, 2021 at 5:04 PM Robert Muir  wrote:
>>
>> Peter, looks like you are way ahead of me :) Thanks for all the work
>> you have been doing here, and thanks to Dawid for helping!
>>
>> You probably know a lot of this code better than me at this point, but
>> I remember a couple of these pain points, inline below:
>>
>> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>>  wrote:
>> >
>> > Hi Robert,
>> >
>> > Yes, having multiple dictionaries in the same process would increase the 
>> > memory significantly. Do you have any idea about how many of them people 
>> > are loading, and how much memory they give to Lucene?
>>
>> Yeah in many cases, the user is using a server such as solr or elasticsearch.
>> Let's use solr as an example, as others are here to correct it, if I am 
>> wrong.
>>
>> Example to understand the challenges: user uses one of solr's 3
>> mechanisms to detect language and send to different pipeline:
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> Now we know these language detectors are imperfect, if the user maps a
>> lot of languages to hunspell pipelines, they may load lots of
>> dictionaries, even by just one stray miscategorized document.
>> So it doesn't have to be some extreme "enterprise" use-case like
>> wikipedia.org, it can happen for a little guy faced with a
>> multilingual corpus.
>>
>> Imagine the user decides to go further, and host solr search in this
>> way for a couple local businesses or govt agencies.
>> They support many languages and possibly use this detection scheme
>> above to try to make language a "non-issue".
>> The user may assign each customer a solr "core" (separate index) with
>> this configuration.
>> Does each solr core load its own HunspellStemFactory? I think it might
>> (in isolated classloader), I could be wrong.
>>
>> For the elasticsearch case, maybe the resource usage in the same case
>> is lower, because they reuse dictionaries per-node?
>> I think this is how it works, but I honestly can't remember.
>> Still the problem remains, easy to end up with dozens of these things in 
>> memory.
>>
>> Also we have the problem that memory usage for a specific can blow up
>> in several ways.
>> Some languages have bigger .aff file than .dic!
>>
>> > Thanks for the idea about root arcs. I've done some quick sampling and 
>> > tracing (for German). 80% of root arc processing time is spent in direct 
>> > addressing, and the remainder is linear scan (so root acrs don't seem to 
>> > present major issues). For non-root arcs, ~50% is directly addressed, ~45% 
>> > linearly-scanned, and the remainder binary-searched. Overall there's about 
>> > 60% of direct addressing, both in time and invocation counts, which 
>> > doesn't seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. 
>> > Reducing that might increase the number of directly addressed arcs, but 
>> > I'm not sure that'd speed up much given that time and invocation counts 
>> > seem to correlate.
>> >
>>
>> Sure, but 20% of those linear scans are maybe 7x slower, its
>> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
>> Hard to reason about, but maybe worth testing out. It still helps for
>> all the other segmenters (japanese, korean) using fst.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: 

Re: Hunspell performance

2021-02-10 Thread Peter Gromov
I was hoping for some numbers :) In the meantime, I've got some of my own.
I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
(there's more, but I ignored dialects of the same base language). Together
they currently consume a humble 166MB. With one of my less memory-hungry
approaches, they'd take ~500MB (maybe less if I optimize, but probably not
significantly). Is this very bad or tolerable for, say, 50% speedup?

I've seen huge *.aff files, and I'm planning to do something with affix
FSTs, too. They take some noticeable time, too, but much less than *.dic-s
one, so for now I concentrate on *.dic.

> Sure, but 20% of those linear scans are maybe 7x slower

Checked that. The distribution appears to be decreasing monotonically. No
linear scans are longer than 8, and ~85% of all linear scans end after no
more than 1 miss.

I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
complicated :(

On Wed, Feb 10, 2021 at 5:04 PM Robert Muir  wrote:

> Peter, looks like you are way ahead of me :) Thanks for all the work
> you have been doing here, and thanks to Dawid for helping!
>
> You probably know a lot of this code better than me at this point, but
> I remember a couple of these pain points, inline below:
>
> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>  wrote:
> >
> > Hi Robert,
> >
> > Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yeah in many cases, the user is using a server such as solr or
> elasticsearch.
> Let's use solr as an example, as others are here to correct it, if I am
> wrong.
>
> Example to understand the challenges: user uses one of solr's 3
> mechanisms to detect language and send to different pipeline:
>
> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
> Now we know these language detectors are imperfect, if the user maps a
> lot of languages to hunspell pipelines, they may load lots of
> dictionaries, even by just one stray miscategorized document.
> So it doesn't have to be some extreme "enterprise" use-case like
> wikipedia.org, it can happen for a little guy faced with a
> multilingual corpus.
>
> Imagine the user decides to go further, and host solr search in this
> way for a couple local businesses or govt agencies.
> They support many languages and possibly use this detection scheme
> above to try to make language a "non-issue".
> The user may assign each customer a solr "core" (separate index) with
> this configuration.
> Does each solr core load its own HunspellStemFactory? I think it might
> (in isolated classloader), I could be wrong.
>
> For the elasticsearch case, maybe the resource usage in the same case
> is lower, because they reuse dictionaries per-node?
> I think this is how it works, but I honestly can't remember.
> Still the problem remains, easy to end up with dozens of these things in
> memory.
>
> Also we have the problem that memory usage for a specific can blow up
> in several ways.
> Some languages have bigger .aff file than .dic!
>
> > Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
> >
>
> Sure, but 20% of those linear scans are maybe 7x slower, its
> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
> Hard to reason about, but maybe worth testing out. It still helps for
> all the other segmenters (japanese, korean) using fst.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: 8.8.1 release soon

2021-02-10 Thread Tomás Fernández Löbbe
I'd like to get SOLR-15114
 in. It already has a
patch that I'm testing, I'll try to merge it today.

On Wed, Feb 10, 2021 at 8:23 AM Timothy Potter  wrote:

> Hi Ishan,
>
> Please let me know how SOLR-15138 is looking on Friday and we can make a
> decision then. My hope is for 8.8.1 sooner than later, but a couple more
> days seems fine too.
>
> Cheers,
> Tim
>
> On Wed, Feb 10, 2021 at 8:55 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> I'd like for us to include SOLR-15138 please, but the fix is still under
>> review and development. Please let us know if it should be possible for us
>> to wait until that one is done (hopefully quickly), otherwise we can
>> release it later (if you want to proceed with the release before this is
>> ready). Thanks for volunteering!
>>
>> On Wed, 10 Feb, 2021, 9:07 pm Timothy Potter, 
>> wrote:
>>
>>> I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it
>>> seems we have no automated SolrJ back-compat tests in our RC vetting
>>> process, so unfortunately older SolrJ clients don't work with Solr 8.8
>>> server, see SOLR-15145.
>>>
>>> I'd like to release 8.8.1 ASAP to address this problem and will be the
>>> RM.
>>>
>>> Let me know if you have any other issues you think need to go into
>>> 8.8.1, otherwise I'd like to build an RC tomorrow AM US time. It looks like
>>> there are already a number of updates going in for 8.9 so let's keep the
>>> updates for 8.8.1 to a minimum please.
>>>
>>> Cheers,
>>> Tim
>>>
>>


Re: Hunspell performance

2021-02-10 Thread Peter Gromov
>
> at the price of not being able to enumerate all of node's outgoing arcs.
>

So FSTEnum isn't possible there? Too bad, I need it for suggestions.


Re: 8.8.1 release soon

2021-02-10 Thread Timothy Potter
Hi Ishan,

Please let me know how SOLR-15138 is looking on Friday and we can make a
decision then. My hope is for 8.8.1 sooner than later, but a couple more
days seems fine too.

Cheers,
Tim

On Wed, Feb 10, 2021 at 8:55 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> I'd like for us to include SOLR-15138 please, but the fix is still under
> review and development. Please let us know if it should be possible for us
> to wait until that one is done (hopefully quickly), otherwise we can
> release it later (if you want to proceed with the release before this is
> ready). Thanks for volunteering!
>
> On Wed, 10 Feb, 2021, 9:07 pm Timothy Potter, 
> wrote:
>
>> I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it
>> seems we have no automated SolrJ back-compat tests in our RC vetting
>> process, so unfortunately older SolrJ clients don't work with Solr 8.8
>> server, see SOLR-15145.
>>
>> I'd like to release 8.8.1 ASAP to address this problem and will be the RM.
>>
>> Let me know if you have any other issues you think need to go into 8.8.1,
>> otherwise I'd like to build an RC tomorrow AM US time. It looks like there
>> are already a number of updates going in for 8.9 so let's keep the updates
>> for 8.8.1 to a minimum please.
>>
>> Cheers,
>> Tim
>>
>


Re: Hunspell performance

2021-02-10 Thread Robert Muir
Peter, looks like you are way ahead of me :) Thanks for all the work
you have been doing here, and thanks to Dawid for helping!

You probably know a lot of this code better than me at this point, but
I remember a couple of these pain points, inline below:

On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
 wrote:
>
> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the 
> memory significantly. Do you have any idea about how many of them people are 
> loading, and how much memory they give to Lucene?

Yeah in many cases, the user is using a server such as solr or elasticsearch.
Let's use solr as an example, as others are here to correct it, if I am wrong.

Example to understand the challenges: user uses one of solr's 3
mechanisms to detect language and send to different pipeline:
https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
Now we know these language detectors are imperfect, if the user maps a
lot of languages to hunspell pipelines, they may load lots of
dictionaries, even by just one stray miscategorized document.
So it doesn't have to be some extreme "enterprise" use-case like
wikipedia.org, it can happen for a little guy faced with a
multilingual corpus.

Imagine the user decides to go further, and host solr search in this
way for a couple local businesses or govt agencies.
They support many languages and possibly use this detection scheme
above to try to make language a "non-issue".
The user may assign each customer a solr "core" (separate index) with
this configuration.
Does each solr core load its own HunspellStemFactory? I think it might
(in isolated classloader), I could be wrong.

For the elasticsearch case, maybe the resource usage in the same case
is lower, because they reuse dictionaries per-node?
I think this is how it works, but I honestly can't remember.
Still the problem remains, easy to end up with dozens of these things in memory.

Also we have the problem that memory usage for a specific can blow up
in several ways.
Some languages have bigger .aff file than .dic!

> Thanks for the idea about root arcs. I've done some quick sampling and 
> tracing (for German). 80% of root arc processing time is spent in direct 
> addressing, and the remainder is linear scan (so root acrs don't seem to 
> present major issues). For non-root arcs, ~50% is directly addressed, ~45% 
> linearly-scanned, and the remainder binary-searched. Overall there's about 
> 60% of direct addressing, both in time and invocation counts, which doesn't 
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing 
> that might increase the number of directly addressed arcs, but I'm not sure 
> that'd speed up much given that time and invocation counts seem to correlate.
>

Sure, but 20% of those linear scans are maybe 7x slower, its
O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
Hard to reason about, but maybe worth testing out. It still helps for
all the other segmenters (japanese, korean) using fst.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Hunspell performance

2021-02-10 Thread Dawid Weiss
> They just seem to need reading/analyzing too many bytes, doing much more
work than a typical hashmap access :)

This is a very tough score to beat... Pretty much any trie structure will
have to descend somehow. FSTs are additionally densely packed in Lucene and
outgoing arc lookup is what's causing the slowdown. An alternative
memory-conservative fst representation is a hash table of [fst-node_id,
out-arc-label] -> fst-node_id. A structure like this one can be made
reasonably compact and path traversals (lookups) are constant-time... at
the price of not being able to enumerate all of node's outgoing arcs. I've
used this approach in the past and it gave a reasonable speedup, although
it's a really specialized data structure.

D.

On Wed, Feb 10, 2021 at 3:43 PM Peter Gromov
 wrote:

> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yes, I've mentioned I've prototyped "using FST in a smarter way" :)
> Namely, it's possible to cache the arcs/outputs used for searching for
> "electrification" and reuse most of them after an affix is stripped and
> we're now faced with "electrify". This allocates a bit more for each token,
> but gives a noticeable speedup. I'm not entirely happy with the resulting
> code complexity and performance, but I can create a PR.
>
> I'm talking only about plain old affix removal. I have no inexact
> matching. Decomposition basically works like "try to break the word in
> various places and stem them separately, looking at some additional flags".
> For the first word part, some arc/outputs could be reused from initial
> analysis, but for the next ones most likely not. And when I tried the
> aforementioned reusing, the code became so unpleasant that I started
> looking for alternatives :)
>
> One thing I don't like about the arc caching approach is that it looks
> like a dead end: the FST invocation count seems to be already close to
> minimal, and yet its traversal is still very visible in the CPU snapshots.
> And I see no low-hanging fruits in FST internals. They just seem to need
> reading/analyzing too many bytes, doing much more work than a typical
> hashmap access :)
>
> Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
>
> Peter
>
> On Wed, Feb 10, 2021 at 2:52 PM Robert Muir  wrote:
>
>> Just throwing out another random idea: if you are doing a lot of FST
>> traversals (e.g. for inexact matching or decomposition), you may end
>> out "hammering" the root arcs of the FST heavily, depending on how the
>> algorithm works. Because root arcs are "busy", they end out being
>> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
>> are doing "decompounding" too, and have hacks to waste some RAM,
>> ensuring the heavy root arc traversals are O(1):
>>
>> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>>
>> Bruno did some FST improvements across the board here, but last time
>> we checked, these hacks were still needed for segmentation usecases
>> like this: see his benchmark here: https://s.apache.org/ffelc
>>
>> For example, maybe it makes sense to cache a few hundred nodes here in
>> a similar way, depending on dictionary's alphabet size, to accelerate
>> segmentation, I don't know if it will help. Maybe also the current FST
>> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
>> rather than BYTE2 or BYTE4 or whatever it is using now. The current
>> stemming doesn't put much pressure on this, so it isn't optimized.
>>
>> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir  wrote:
>> >
>> > The RAM usage used to be bad as you describe, it blows up way worse
>> > for other languages than German. There were many issues :)
>> >
>> > For Lucene, one common issue was that users wanted to have a lot of
>> > these things in RAM: e.g. supporting many different languages on a
>> > single server (multilingual data) and so forth.
>> > Can we speed up your use-case by using the FST in a smarter way? Why
>> > are there so many traversals... is it the way it is doing inexact
>> > matching? decomposition?
>> >
>> > That was the trick done with 

Re: 8.8.1 release soon

2021-02-10 Thread Ishan Chattopadhyaya
I'd like for us to include SOLR-15138 please, but the fix is still under
review and development. Please let us know if it should be possible for us
to wait until that one is done (hopefully quickly), otherwise we can
release it later (if you want to proceed with the release before this is
ready). Thanks for volunteering!

On Wed, 10 Feb, 2021, 9:07 pm Timothy Potter,  wrote:

> I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it
> seems we have no automated SolrJ back-compat tests in our RC vetting
> process, so unfortunately older SolrJ clients don't work with Solr 8.8
> server, see SOLR-15145.
>
> I'd like to release 8.8.1 ASAP to address this problem and will be the RM.
>
> Let me know if you have any other issues you think need to go into 8.8.1,
> otherwise I'd like to build an RC tomorrow AM US time. It looks like there
> are already a number of updates going in for 8.9 so let's keep the updates
> for 8.8.1 to a minimum please.
>
> Cheers,
> Tim
>


8.8.1 release soon

2021-02-10 Thread Timothy Potter
I was a tad bit ambitious with backporting SOLR-12182 to 8.8.0 and it seems
we have no automated SolrJ back-compat tests in our RC vetting process, so
unfortunately older SolrJ clients don't work with Solr 8.8 server, see
SOLR-15145.

I'd like to release 8.8.1 ASAP to address this problem and will be the RM.

Let me know if you have any other issues you think need to go into 8.8.1,
otherwise I'd like to build an RC tomorrow AM US time. It looks like there
are already a number of updates going in for 8.9 so let's keep the updates
for 8.8.1 to a minimum please.

Cheers,
Tim


Re: Hunspell performance

2021-02-10 Thread Peter Gromov
Hi Robert,

Yes, having multiple dictionaries in the same process would increase the
memory significantly. Do you have any idea about how many of them people
are loading, and how much memory they give to Lucene?

Yes, I've mentioned I've prototyped "using FST in a smarter way" :) Namely,
it's possible to cache the arcs/outputs used for searching for
"electrification" and reuse most of them after an affix is stripped and
we're now faced with "electrify". This allocates a bit more for each token,
but gives a noticeable speedup. I'm not entirely happy with the resulting
code complexity and performance, but I can create a PR.

I'm talking only about plain old affix removal. I have no inexact matching.
Decomposition basically works like "try to break the word in various places
and stem them separately, looking at some additional flags". For the first
word part, some arc/outputs could be reused from initial analysis, but for
the next ones most likely not. And when I tried the aforementioned reusing,
the code became so unpleasant that I started looking for alternatives :)

One thing I don't like about the arc caching approach is that it looks like
a dead end: the FST invocation count seems to be already close to minimal,
and yet its traversal is still very visible in the CPU snapshots. And I see
no low-hanging fruits in FST internals. They just seem to need
reading/analyzing too many bytes, doing much more work than a typical
hashmap access :)

Thanks for the idea about root arcs. I've done some quick sampling and
tracing (for German). 80% of root arc processing time is spent in direct
addressing, and the remainder is linear scan (so root acrs don't seem to
present major issues). For non-root arcs, ~50% is directly addressed, ~45%
linearly-scanned, and the remainder binary-searched. Overall there's about
60% of direct addressing, both in time and invocation counts, which doesn't
seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
that might increase the number of directly addressed arcs, but I'm not sure
that'd speed up much given that time and invocation counts seem to
correlate.

Peter

On Wed, Feb 10, 2021 at 2:52 PM Robert Muir  wrote:

> Just throwing out another random idea: if you are doing a lot of FST
> traversals (e.g. for inexact matching or decomposition), you may end
> out "hammering" the root arcs of the FST heavily, depending on how the
> algorithm works. Because root arcs are "busy", they end out being
> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
> are doing "decompounding" too, and have hacks to waste some RAM,
> ensuring the heavy root arc traversals are O(1):
>
> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>
> Bruno did some FST improvements across the board here, but last time
> we checked, these hacks were still needed for segmentation usecases
> like this: see his benchmark here: https://s.apache.org/ffelc
>
> For example, maybe it makes sense to cache a few hundred nodes here in
> a similar way, depending on dictionary's alphabet size, to accelerate
> segmentation, I don't know if it will help. Maybe also the current FST
> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
> rather than BYTE2 or BYTE4 or whatever it is using now. The current
> stemming doesn't put much pressure on this, so it isn't optimized.
>
> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir  wrote:
> >
> > The RAM usage used to be bad as you describe, it blows up way worse
> > for other languages than German. There were many issues :)
> >
> > For Lucene, one common issue was that users wanted to have a lot of
> > these things in RAM: e.g. supporting many different languages on a
> > single server (multilingual data) and so forth.
> > Can we speed up your use-case by using the FST in a smarter way? Why
> > are there so many traversals... is it the way it is doing inexact
> > matching? decomposition?
> >
> > That was the trick done with stemming, and the stemming was
> > accelerated with some side data structures. For example "patternIndex"
> > thing which is a scary precomputed list of tableized DFAs... its
> > wasting a "little" space with these tables to speed up hotspot for
> > stemming. In that patternIndex example, some assumptions / limits had
> > to be set, that hopefully no dictionary would ever make: that's all
> > the "please report this to dev@lucene.apache.org" checks in the code.
> > some tests were run against all the crazy OO dictionaries out there to
> > examine the memory usage when looking at changes like this. Some of
> > these are really, really crazy and do surprising things.
> >
> > On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
> >  wrote:
> > >
> > > Hi there,
> > >
> > > I'm mostly done with supporting major Hunspell features necessary for
> most european languages 

Re: Hunspell performance

2021-02-10 Thread Robert Muir
Just throwing out another random idea: if you are doing a lot of FST
traversals (e.g. for inexact matching or decomposition), you may end
out "hammering" the root arcs of the FST heavily, depending on how the
algorithm works. Because root arcs are "busy", they end out being
O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
are doing "decompounding" too, and have hacks to waste some RAM,
ensuring the heavy root arc traversals are O(1):
https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java

Bruno did some FST improvements across the board here, but last time
we checked, these hacks were still needed for segmentation usecases
like this: see his benchmark here: https://s.apache.org/ffelc

For example, maybe it makes sense to cache a few hundred nodes here in
a similar way, depending on dictionary's alphabet size, to accelerate
segmentation, I don't know if it will help. Maybe also the current FST
"INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
rather than BYTE2 or BYTE4 or whatever it is using now. The current
stemming doesn't put much pressure on this, so it isn't optimized.

On Wed, Feb 10, 2021 at 7:53 AM Robert Muir  wrote:
>
> The RAM usage used to be bad as you describe, it blows up way worse
> for other languages than German. There were many issues :)
>
> For Lucene, one common issue was that users wanted to have a lot of
> these things in RAM: e.g. supporting many different languages on a
> single server (multilingual data) and so forth.
> Can we speed up your use-case by using the FST in a smarter way? Why
> are there so many traversals... is it the way it is doing inexact
> matching? decomposition?
>
> That was the trick done with stemming, and the stemming was
> accelerated with some side data structures. For example "patternIndex"
> thing which is a scary precomputed list of tableized DFAs... its
> wasting a "little" space with these tables to speed up hotspot for
> stemming. In that patternIndex example, some assumptions / limits had
> to be set, that hopefully no dictionary would ever make: that's all
> the "please report this to dev@lucene.apache.org" checks in the code.
> some tests were run against all the crazy OO dictionaries out there to
> examine the memory usage when looking at changes like this. Some of
> these are really, really crazy and do surprising things.
>
> On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
>  wrote:
> >
> > Hi there,
> >
> > I'm mostly done with supporting major Hunspell features necessary for most 
> > european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but 
> > of course I anticipate more minor fixes to come). Thanks Dawid Weiss for 
> > thorough reviews and prompt accepting my PRs so far!
> >
> > Now I'd like to make this Hunspell implementation at least as fast as the 
> > native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times 
> > slower for me, depending on the language (I've checked en/de/fr so far). 
> > I've profiled it, done some minor optimizations, and now it appears that 
> > most time is taken by FST traversals. I've prototyped decreasing the number 
> > of these traversals, and the execution time goes down noticeably (e.g. 
> > 30%), but it's still not enough, and the code becomes complicated.
> >
> > So I'm considering other data structures instead of FSTs (Hunspell/C++ 
> > itself doesn't bother with tries: it uses hash tables and linear searches 
> > instead). The problem is, FST is very well space-optimized, and other data 
> > structures consume more memory.
> >
> > So my question is: what's the relative importance of speed and memory in 
> > Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK 
> > to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% 
> > improvement in stemming)? Or, with a BytesRefHash plus an array I can make 
> > it ~9MB, with almost the same speedup (but more complex code).
> >
> > How much memory usage is acceptable at all?
> >
> > Maybe there are other suitable data structures in Lucene core that I'm not 
> > aware of? I basically need a Map, which'd be better queried 
> > with a char[]+offset+length keys (like CharArrayMap does).
> >
> > Peter

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Hunspell performance

2021-02-10 Thread Robert Muir
The RAM usage used to be bad as you describe, it blows up way worse
for other languages than German. There were many issues :)

For Lucene, one common issue was that users wanted to have a lot of
these things in RAM: e.g. supporting many different languages on a
single server (multilingual data) and so forth.
Can we speed up your use-case by using the FST in a smarter way? Why
are there so many traversals... is it the way it is doing inexact
matching? decomposition?

That was the trick done with stemming, and the stemming was
accelerated with some side data structures. For example "patternIndex"
thing which is a scary precomputed list of tableized DFAs... its
wasting a "little" space with these tables to speed up hotspot for
stemming. In that patternIndex example, some assumptions / limits had
to be set, that hopefully no dictionary would ever make: that's all
the "please report this to dev@lucene.apache.org" checks in the code.
some tests were run against all the crazy OO dictionaries out there to
examine the memory usage when looking at changes like this. Some of
these are really, really crazy and do surprising things.

On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
 wrote:
>
> Hi there,
>
> I'm mostly done with supporting major Hunspell features necessary for most 
> european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but 
> of course I anticipate more minor fixes to come). Thanks Dawid Weiss for 
> thorough reviews and prompt accepting my PRs so far!
>
> Now I'd like to make this Hunspell implementation at least as fast as the 
> native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times 
> slower for me, depending on the language (I've checked en/de/fr so far). I've 
> profiled it, done some minor optimizations, and now it appears that most time 
> is taken by FST traversals. I've prototyped decreasing the number of these 
> traversals, and the execution time goes down noticeably (e.g. 30%), but it's 
> still not enough, and the code becomes complicated.
>
> So I'm considering other data structures instead of FSTs (Hunspell/C++ itself 
> doesn't bother with tries: it uses hash tables and linear searches instead). 
> The problem is, FST is very well space-optimized, and other data structures 
> consume more memory.
>
> So my question is: what's the relative importance of speed and memory in 
> Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK to 
> use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% 
> improvement in stemming)? Or, with a BytesRefHash plus an array I can make it 
> ~9MB, with almost the same speedup (but more complex code).
>
> How much memory usage is acceptable at all?
>
> Maybe there are other suitable data structures in Lucene core that I'm not 
> aware of? I basically need a Map, which'd be better queried 
> with a char[]+offset+length keys (like CharArrayMap does).
>
> Peter

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Hunspell performance

2021-02-10 Thread Peter Gromov
Hi there,

I'm mostly done with supporting major Hunspell features necessary for most
european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but
of course I anticipate more minor fixes to come). Thanks Dawid Weiss for
thorough reviews and prompt accepting my PRs so far!

Now I'd like to make this Hunspell implementation at least as fast as the
native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times
slower for me, depending on the language (I've checked en/de/fr so far).
I've profiled it, done some minor optimizations, and now it appears that
most time is taken by FST traversals. I've prototyped decreasing the number
of these traversals, and the execution time goes down noticeably (e.g.
30%), but it's still not enough, and the code becomes complicated.

So I'm considering other data structures instead of FSTs (Hunspell/C++
itself doesn't bother with tries: it uses hash tables and linear searches
instead). The problem is, FST is very well space-optimized, and other data
structures consume more memory.

So my question is: what's the relative importance of speed and memory in
Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK
to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
improvement in stemming)? Or, with a BytesRefHash plus an array I can make
it ~9MB, with almost the same speedup (but more complex code).

How much memory usage is acceptable at all?

Maybe there are other suitable data structures in Lucene core that I'm not
aware of? I basically need a Map, which'd be better queried
with a char[]+offset+length keys (like CharArrayMap does).

Peter


Re: Seeking an adventurous individual that has decent SolrCloud experience and works well independently and as part of a team.

2021-02-10 Thread Mark Miller
Thanks to those that volunteered for this. We are almost ready for kick
off. I’ve just got a couple more tests to run and a few help docs to finish.

Also, thanks David Smiley for helping to recruit. Sorry bout that outburst
a while back, I literally woke up at 4 am or whenever on the couch, took a
look at my phone, lashed out in a half sleep state and rolled back over to
sleep.

This is about 100x a challenge as I’ve applied myself to and it pushed me
in ways I have not been pushed. It pushed those close in my life similarly.
I am an ADD, shortcut machine, and this endeavor required the opposite of
that, and for a very, very long time (with some breaks thankfully). It’s a
culmination of my entire life experience of programming, and the first
thing since Lucene that I am pleased to associate my name with. Hopefully
the sausage tastes better than the process.

MRM

*"What, so everyone’s supposed to sleep every single night now? You realize
that nighttime makes up half of all time?"*

On Thu, Feb 4, 2021 at 12:54 PM Mark Miller  wrote:

> Thanks, I should be all set now.
>
> MRM
>
> “I’ll tell you how I feel about school, Jerry: it’s a waste of time. Bunch
> of people runnin’ around bumpin’ into each other, got a guy up front says,
> ‘2 + 2,’ and the people in the back say, ‘4.’ Then the bell rings and they
> give you a carton of milk and a piece of paper that says you can go take a
> dump or somethin’. I mean, it’s not a place for smart people, Jerry. I know
> that’s not a popular opinion, but that’s my two cents on the issue.”
>
> On Wed, Feb 3, 2021 at 8:18 PM Mark Miller  wrote:
>
>> Hey there.
>>
>> Do you have SolrCloud experience? Do you build clusters and smash updates
>> into them? Would you prefer Solr and SolrCloud to be faster? More stable?
>>
>> I'm looking for someone that is interested in engaging closely for a bit
>> on a project I've been working on. I'm seeking someone that wants to help
>> out with a fast, stable SolrCloud, or someone that has nagging SolrCloud
>> ills that they would like to not experience.
>>
>> I have a version of Solr, let's just call it Stellar ... Stellar Solr
>> sounds good, and a good OPs type that likes solid stuff and has some
>> (limited) time to play would be ideal.
>>
>> Please email me directly at markrmil...@gmail.com if you think you might
>> be interested and have some experience with SolrCloud clusters as well as
>> some limited availability in the near future.
>>
>> - MRM
>>
>> "I'm a scientist; because I invent, transform, create, and destroy for a
>> living, and when I don't like something about the world, I change it."
>>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller