Re: Best strategy migrate indexes

2022-10-29 Thread Baris Kazar
It is always great practice to retain non-indexed
data since when Lucene changes version,
even minor version, I always reindex.

Best regards

From: Gus Heck 
Sent: Saturday, October 29, 2022 2:17 PM
To: java-user@lucene.apache.org 
Subject: Re: Best strategy migrate indexes

Hi Pablo,

The deafening silence is probably nobody wanting to give you the bad news.
You are on a mission that may not be feasible, and even if you can get it
to "work", the end result won't likely be equivalent to indexing the
original data with Lucene 9.x. The indexing process is fundamentally lossy
and information originally used to produce non-stored fields will have been
thrown out. A simple example is things like stopwords or anything analyzed
with subclasses of FilteringTokenFilter. If the stop word list changed, or
the details of one of these filters changed (bugfix?), you will end up with
a different result than indexing with 9.x. This is just one
example, another would be stemming where the index likely only contains the
stem, not the whole word. Other folks who are more interested in the
details of our codecs than I am can probably provide further examples on a
more fundamental level. Lucene is not a database, and the source documents
should always be retained in a form that can be reindexed. If you have
inherited a system where source material has not been retained, you have a
difficult project and may have some potentially painful expectation setting
to perform.

Best,
Gus



On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez 
wrote:

> Hi all,
>
> I have some indices indexed with lucene 5.5.0. I have updated my
> dependencies and code to Lucene 7 (but my final goal is to use Lucene 9)
> and when trying to work with them I am having the exception:
> org.apache.lucene.index.IndexFormatTooOldException: Format version is not
> supported (resource
>
> BufferedChecksumIndexInput(MMapIndexInput(path="...\tests\segments_b"))):
> this index is too old (version: 5.5.0). This version of Lucene only
> supports indexes created with release 6.0 and later.
>
> I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> strategy? Is there any tool to migrate the indices? Is it mandatory to
> reindex? In this case, how can I deal with this when I do not have the
> sources of documents that generated my current indices (I mean, I just have
> the indices themselves)?
>
> Thanks,
>
> --
> Pablo Vázquez
> (pabl...@gmail.com)
>


--
https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
   (work)
https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
   (play)


Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Baris Kazar
Thank You Thank You
Best regards

From: Michael McCandless 
Sent: Saturday, August 6, 2022 11:29:25 AM
To: Baris Kazar 
Cc: java-user@lucene.apache.org 
Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

OK done: 
https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1<https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKR6GBFgnw$>

Mike McCandless

http://blog.mikemccandless.com<https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!OJffdSKrjdfY7VYGcAVGsx4rKHPICvgac4eOcXOf1fnT7u9fJ2RSu9toYPgowHx72UC33Ixg1s1BLKQULWvYcw$>


On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
I think so.
Best regards

From: Michael McCandless 
mailto:luc...@mikemccandless.com>>
Sent: Saturday, August 6, 2022 10:12 AM
To: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> 
mailto:java-user@lucene.apache.org>>
Cc: Baris Kazar mailto:baris.ka...@oracle.com>>
Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

Thanks Baris,

And your Jira ID is bkazar right?

Mike

On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
My github username is bmkazar
can You please register me?
Best regards

From: Michael McCandless 
mailto:luc...@mikemccandless.com>>
Sent: Saturday, August 6, 2022 6:05:51 AM
To: d...@lucene.apache.org<mailto:d...@lucene.apache.org> 
mailto:d...@lucene.apache.org>>
Cc: Lucene Users 
mailto:java-user@lucene.apache.org>>; java-dev 
mailto:java-...@lucene.apache.org>>
Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

Hi Adam, I added your linked accounts here:
https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$

And Tomoko added Rushabh's linked accounts here:

https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$

Keep the linked accounts coming!

Mike

On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
mailto:rushabh.s...@salesforce.com>.invalid> wrote:

> Hi,
> My mapping is:
> JiraName,GitHubAccount,JiraDispName
> shahrs87, shahrs87, Rushabh Shah
>
> Thank you Tomoko and Mike for all of your hard work.
>
>
>
>
> On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> luc...@mikemccandless.com<mailto:luc...@mikemccandless.com>> wrote:
>
>> Hello Lucene users, contributors and developers,
>>
>> If you have used Lucene's Jira and you have a GitHub account as well,
>> please check whether your user id mapping is in this file:
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
>>
>> If not, please reply to this email and we will try to add you.
>>
>> Please forward this email to anyone you know might be impacted and who
>> might not be tracking the Lucene lists.
>>
>>
>> Full details:
>>
>> The Lucene project will soon migrate from Jira to GitHub for issue
>> tracking.
>>
>> There have been discussions, votes, a migration tool created / iterated
>> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
>> dev list.
>>
>> When we run the migration, we would like to map Jira users to the right
>> GitHub users to properly @-mention the right person and make it easier for
>> you to find issues you have engaged with.
>>
>> Mike McCandless
>>
>> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>>
> --
Mike McCandless

https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
--
Mike McCandless

http://blog.mikemccandless.com<https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!JIy9w3Oyvgxri_lPKzszX-rCz4T17oAvHWxs3gLwaxWQ3Ah7toRiMqu3hYT0YP-UnxPR1mSnuaqAoGbejVCNsw$>


Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Baris Kazar
I think so.
Best regards

From: Michael McCandless 
Sent: Saturday, August 6, 2022 10:12 AM
To: java-user@lucene.apache.org 
Cc: Baris Kazar 
Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

Thanks Baris,

And your Jira ID is bkazar right?

Mike

On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
My github username is bmkazar
can You please register me?
Best regards

From: Michael McCandless 
mailto:luc...@mikemccandless.com>>
Sent: Saturday, August 6, 2022 6:05:51 AM
To: d...@lucene.apache.org<mailto:d...@lucene.apache.org> 
mailto:d...@lucene.apache.org>>
Cc: Lucene Users 
mailto:java-user@lucene.apache.org>>; java-dev 
mailto:java-...@lucene.apache.org>>
Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

Hi Adam, I added your linked accounts here:
https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$

And Tomoko added Rushabh's linked accounts here:

https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$

Keep the linked accounts coming!

Mike

On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
mailto:rushabh.s...@salesforce.com>.invalid> wrote:

> Hi,
> My mapping is:
> JiraName,GitHubAccount,JiraDispName
> shahrs87, shahrs87, Rushabh Shah
>
> Thank you Tomoko and Mike for all of your hard work.
>
>
>
>
> On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> luc...@mikemccandless.com<mailto:luc...@mikemccandless.com>> wrote:
>
>> Hello Lucene users, contributors and developers,
>>
>> If you have used Lucene's Jira and you have a GitHub account as well,
>> please check whether your user id mapping is in this file:
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
>>
>> If not, please reply to this email and we will try to add you.
>>
>> Please forward this email to anyone you know might be impacted and who
>> might not be tracking the Lucene lists.
>>
>>
>> Full details:
>>
>> The Lucene project will soon migrate from Jira to GitHub for issue
>> tracking.
>>
>> There have been discussions, votes, a migration tool created / iterated
>> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
>> dev list.
>>
>> When we run the migration, we would like to map Jira users to the right
>> GitHub users to properly @-mention the right person and make it easier for
>> you to find issues you have engaged with.
>>
>> Mike McCandless
>>
>> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>>
> --
Mike McCandless

https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
--
Mike McCandless

http://blog.mikemccandless.com<https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!JIy9w3Oyvgxri_lPKzszX-rCz4T17oAvHWxs3gLwaxWQ3Ah7toRiMqu3hYT0YP-UnxPR1mSnuaqAoGbejVCNsw$>


Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Baris Kazar
My github username is bmkazar
can You please register me?
Best regards

From: Michael McCandless 
Sent: Saturday, August 6, 2022 6:05:51 AM
To: d...@lucene.apache.org 
Cc: Lucene Users ; java-dev 

Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before 
Thursday August 4 midnight (in your local time)

Hi Adam, I added your linked accounts here:
https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$

And Tomoko added Rushabh's linked accounts here:

https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$

Keep the linked accounts coming!

Mike

On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
 wrote:

> Hi,
> My mapping is:
> JiraName,GitHubAccount,JiraDispName
> shahrs87, shahrs87, Rushabh Shah
>
> Thank you Tomoko and Mike for all of your hard work.
>
>
>
>
> On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hello Lucene users, contributors and developers,
>>
>> If you have used Lucene's Jira and you have a GitHub account as well,
>> please check whether your user id mapping is in this file:
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
>>
>> If not, please reply to this email and we will try to add you.
>>
>> Please forward this email to anyone you know might be impacted and who
>> might not be tracking the Lucene lists.
>>
>>
>> Full details:
>>
>> The Lucene project will soon migrate from Jira to GitHub for issue
>> tracking.
>>
>> There have been discussions, votes, a migration tool created / iterated
>> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
>> dev list.
>>
>> When we run the migration, we would like to map Jira users to the right
>> GitHub users to properly @-mention the right person and make it easier for
>> you to find issues you have engaged with.
>>
>> Mike McCandless
>>
>> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>>
> --
Mike McCandless

https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$


Re: Using Lucene 8.5.1 vs 8.5.2

2022-07-26 Thread Baris Kazar
Great, 8.11 has gone further.
Thanks Mike
Best regards

From: Mike Drob 
Sent: Tuesday, July 26, 2022 5:18 PM
To: java-user@lucene.apache.org 
Cc: Baris Kazar 
Subject: Re: Using Lucene 8.5.1 vs 8.5.2

I would use 8.5.2 if possible when considering fuzzy queries. The automation 
can be very large, but if you’re not caching the query then the extra footprint 
is not significant since it needs to be computed  at some point anyway to 
evaluate the query.

Really though, I would use 8.11 over either of those.

Mike Drob

On Tue, Jul 26, 2022 at 1:03 PM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
Dear Folks,-
 May I please ask if using 8.5.1 is ok wrt 8.5.2?
The only change was the following where fuzzy query was fixed for a major bug 
(?).
How much does this affect the fuzzy query performance? Has Dev Team done a 
study to compare Lucene-9350 Bug vs Lucene-9068 Bug?
https://lucene.apache.org/core/8_5_2/changes/Changes.html<https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_2/changes/Changes.html__;!!ACWV5N9M2RV99hQ!KtFc0vtdmzloavgQ0OWbsiryKhP4RuyVDtF3q6Nqvg9FnNjj2upvWEWr-FTAlT_bpV27jxBLCWUjDw$>
https://issues.apache.org/jira/browse/LUCENE-9350<https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-9350__;!!ACWV5N9M2RV99hQ!KtFc0vtdmzloavgQ0OWbsiryKhP4RuyVDtF3q6Nqvg9FnNjj2upvWEWr-FTAlT_bpV27jxAbGl5OUQ$>
Best regards




Lucene 9.1.0 has changed name of lucene-analysis-common-9.1.0.jar

2022-07-26 Thread Baris Kazar
Dear Folks,-
 I see that Lucene has changed one of the JAR files' name to 
lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0.
It used to use analyzers. Can someone please confirm?
Best regards



Re: Performance Comparison of Benchmarks by using Lucene 9.1.0 vs 8.5.1

2022-07-26 Thread Baris Kazar
Great, this was very helpful.
This gives rough idea using the dates of the Lucene bugs/features added on 
those graphs.
Best regards

From: Michael Sokolov 
Sent: Tuesday, July 26, 2022 3:55 PM
To: java-user@lucene.apache.org 
Cc: Baris Kazar 
Subject: Re: Performance Comparison of Benchmarks by using Lucene 9.1.0 vs 8.5.1

https://urldefense.com/v3/__https://home.apache.org/*mikemccand/lucenebench/__;fg!!ACWV5N9M2RV99hQ!MxMLYjBYzRbF_h4Vx__pd6DDXhkE7Tu2WF3eudKJ-YxXBxzvpfhcAMO4Lt1zcBC9lfRrvzZ1Xg8tiSc8Xw$
  shows how various
benchmarks have evolved over time *on the main branch*. There is no
direct comparison of every version against every other version that I
have seen though.

On Tue, Jul 26, 2022 at 2:12 PM Baris Kazar  wrote:
>
> Dear Folks,-
>  Similar question to my previous post: this time I wonder if there is a Lucene
> web site where benchmarks are run against these two versions of Lucene.
> I see many (44+16) api changes and (48+9) improvements and (16+15) Bug fixes, 
> which sounds great.
> Best regards
>


Performance Comparison of Benchmarks by using Lucene 9.1.0 vs 8.5.1

2022-07-26 Thread Baris Kazar
Dear Folks,-
 Similar question to my previous post: this time I wonder if there is a Lucene
web site where benchmarks are run against these two versions of Lucene.
I see many (44+16) api changes and (48+9) improvements and (16+15) Bug fixes, 
which sounds great.
Best regards



Using Lucene 8.5.1 vs 8.5.2

2022-07-26 Thread Baris Kazar
Dear Folks,-
 May I please ask if using 8.5.1 is ok wrt 8.5.2?
The only change was the following where fuzzy query was fixed for a major bug 
(?).
How much does this affect the fuzzy query performance? Has Dev Team done a 
study to compare Lucene-9350 Bug vs Lucene-9068 Bug?
https://lucene.apache.org/core/8_5_2/changes/Changes.html
https://issues.apache.org/jira/browse/LUCENE-9350
Best regards




Re: How to handle corrupt Lucene index

2022-04-13 Thread Baris Kazar
yes that is a great point to look at first and that would eliminate any jdbc 
related issues that may lead to such problems.
Best regards

From: Tim Whittington 
Sent: Wednesday, April 13, 2022 9:17:44 PM
To: java-user@lucene.apache.org 
Subject: Re: How to handle corrupt Lucene index

Thanks for this - I'll have a look at the database server code that is
managing the Lucene indexes and see if I can track it down.

Tim

On Thu, 14 Apr 2022 at 12:41, Robert Muir  wrote:

> On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
>  wrote:
> >
> > I'm working with/on a database system that uses Lucene for full text
> > indexes (currently using 7.3.0).
> > We're encountering occasional problems that occur after unclean shutdowns
> > of the database , resulting in
> > "org.apache.lucene.index.CorruptIndexException: file mismatch" errors
> when
> > the IndexWriter is constructed.
> >
> > In all of the cases this has occurred, CheckIndex finds no issues with
> the
> > Lucene index.
> >
> > The database has write-ahead-log and recovery facilities, so making the
> > Lucene indexes durable wrt database operations is doable, but in this
> case
> > the IndexWriter itself is failing to initialise, so it looks like there
> > needs to be a lower-level validation/recovery operation before
> reconciling
> > transactions can take place.
> >
> > Can anyone provide any advice about how the database can detect and
> recover
> > from this situation?
> >
>
> File mismatch means files are getting mixed up. It is the equivalent
> of swapping say, /etc/hosts and /etc/passwd on your computer.
>
> In your case you have a .si file (lets say it is named _79.si) that
> really belongs to another segment (e.g. _42).
>
> This isn't a lucene issue, this is something else you must be using
> that is "transporting files around", and it is mixing the files up.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: How to handle corrupt Lucene index

2022-04-13 Thread Baris Kazar
That is a good practice and i pointed out that since i saw lucene 7.0 in the 
stack trace.


Best regards

From: Tim Whittington 
Sent: Wednesday, April 13, 2022 9:15 PM
To: java-user@lucene.apache.org 
Subject: Re: How to handle corrupt Lucene index

To be clear, these indexes are created and read with the same Lucene
version (7.3.0).

Tim

On Thu, 14 Apr 2022 at 12:45, Baris Kazar  wrote:

> In my experience that if you built index at version x then use index also
> in version x.
> I never encountered any problems this way witj Lucene.
>
> Can you maybe recreate lucene index on 7.3.0?
>
> Also how do you use database in your scenario?
> Are you using jdbc like operations like in Oracle database? lucene
> operations are independent of database operations.
>
> Best regards
> 
> From: Tim Whittington 
> Sent: Wednesday, April 13, 2022 8:24 PM
> To: java-user@lucene.apache.org 
> Subject: How to handle corrupt Lucene index
>
> I'm working with/on a database system that uses Lucene for full text
> indexes (currently using 7.3.0).
> We're encountering occasional problems that occur after unclean shutdowns
> of the database , resulting in
> "org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
> the IndexWriter is constructed.
>
> In all of the cases this has occurred, CheckIndex finds no issues with the
> Lucene index.
>
> The database has write-ahead-log and recovery facilities, so making the
> Lucene indexes durable wrt database operations is doable, but in this case
> the IndexWriter itself is failing to initialise, so it looks like there
> needs to be a lower-level validation/recovery operation before reconciling
> transactions can take place.
>
> Can anyone provide any advice about how the database can detect and recover
> from this situation?
>
> thanks
> Tim
> ---
>
> Relevant parts of the exception:
>
> org.apache.lucene.index.CorruptIndexException: file mismatch, expected
> id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
>
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases//luceneIndexes/SearchNameIx/_
> 8x.si")))
> at
> org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
> at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
> at
>
> org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1121)
> -- 8< --
>


Re: How to handle corrupt Lucene index

2022-04-13 Thread Baris Kazar
In my experience that if you built index at version x then use index also in 
version x.
I never encountered any problems this way witj Lucene.

Can you maybe recreate lucene index on 7.3.0?

Also how do you use database in your scenario?
Are you using jdbc like operations like in Oracle database? lucene operations 
are independent of database operations.

Best regards

From: Tim Whittington 
Sent: Wednesday, April 13, 2022 8:24 PM
To: java-user@lucene.apache.org 
Subject: How to handle corrupt Lucene index

I'm working with/on a database system that uses Lucene for full text
indexes (currently using 7.3.0).
We're encountering occasional problems that occur after unclean shutdowns
of the database , resulting in
"org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
the IndexWriter is constructed.

In all of the cases this has occurred, CheckIndex finds no issues with the
Lucene index.

The database has write-ahead-log and recovery facilities, so making the
Lucene indexes durable wrt database operations is doable, but in this case
the IndexWriter itself is failing to initialise, so it looks like there
needs to be a lower-level validation/recovery operation before reconciling
transactions can take place.

Can anyone provide any advice about how the database can detect and recover
from this situation?

thanks
Tim
---

Relevant parts of the exception:

org.apache.lucene.index.CorruptIndexException: file mismatch, expected
id=e673n8syolqg0phzxvw8d7czu, got=dwpa40yzwp7gf06xibrsx1pn2
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/databases//luceneIndexes/SearchNameIx/_
8x.si")))
at org.apache.lucene.codecs.CodecUtil.checkIndexHeaderID(CodecUtil.java:351)
at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:256)
at
org.apache.lucene.codecs.lucene70.Lucene70SegmentInfoFormat.read(Lucene70SegmentInfoFormat.java:95)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:360)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
at
org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1121)
-- 8< --


Re: How to propose a new feature

2022-04-01 Thread Baris Kazar
This cache can work on different indexable fields or even maybe stored fields. 
But indexable fields is better i think.
It can be configured to cache which fields, too. Probably most people may 
choose all indexable fields.
Thanks

From: Baris Kazar 
Sent: Friday, April 1, 2022 1:03 PM
To: Adrien Grand ; Lucene Users Mailing List 
; Baris Kazar 
Subject: Re: How to propose a new feature

I am proposing to add a prefetch cache to the architecture of Lucene core 
engine.
I think there was some mechanism before like fetching 100 documents from hits 
contructor.
I want to expand this to a cache structure such that the cache will hold most 
frequently hit results for a while with some well-known cache strategies.
Or we can come up as Lucene community with a new caching design.

Maybe this is already available in the code and i will be happy if you can 
point me to.
If not, what are the thoughts on this?
Thanks


From: Adrien Grand 
Sent: Friday, April 1, 2022 12:58 PM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: How to propose a new feature

Just send an email with the problem that you want to solve and the
approach that you are suggesting.

On Fri, Apr 1, 2022 at 6:56 PM Baris Kazar  wrote:
>
> Resent due to need for help.
> Thanks
> ____
> From: Baris Kazar
> Sent: Wednesday, March 30, 2022 2:30 PM
> To: java-user@lucene.apache.org 
> Cc: Baris Kazar 
> Subject: How to propose a new feature
>
> Hi Everyone,-
> What is the process to propose a new feature for Core Lucene engine?
> Best regards



--
Adrien


Re: How to propose a new feature

2022-04-01 Thread Baris Kazar
I am proposing to add a prefetch cache to the architecture of Lucene core 
engine.
I think there was some mechanism before like fetching 100 documents from hits 
contructor.
I want to expand this to a cache structure such that the cache will hold most 
frequently hit results for a while with some well-known cache strategies.
Or we can come up as Lucene community with a new caching design.

Maybe this is already available in the code and i will be happy if you can 
point me to.
If not, what are the thoughts on this?
Thanks


From: Adrien Grand 
Sent: Friday, April 1, 2022 12:58 PM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: How to propose a new feature

Just send an email with the problem that you want to solve and the
approach that you are suggesting.

On Fri, Apr 1, 2022 at 6:56 PM Baris Kazar  wrote:
>
> Resent due to need for help.
> Thanks
> ____
> From: Baris Kazar
> Sent: Wednesday, March 30, 2022 2:30 PM
> To: java-user@lucene.apache.org 
> Cc: Baris Kazar 
> Subject: How to propose a new feature
>
> Hi Everyone,-
> What is the process to propose a new feature for Core Lucene engine?
> Best regards



--
Adrien


Re: How to propose a new feature

2022-04-01 Thread Baris Kazar
Resent due to need for help.
Thanks

From: Baris Kazar
Sent: Wednesday, March 30, 2022 2:30 PM
To: java-user@lucene.apache.org 
Cc: Baris Kazar 
Subject: How to propose a new feature

Hi Everyone,-
What is the process to propose a new feature for Core Lucene engine?
Best regards


How to propose a new feature

2022-03-30 Thread Baris Kazar
Hi Everyone,-
What is the process to propose a new feature for Core Lucene engine?
Best regards


Re: test

2022-02-20 Thread Baris Kazar
Yes, please. Welcome.

Best regards

From: Claude Lepere 
Sent: Sunday, February 20, 2022 1:32 PM
To: java-user@lucene.apache.org
Subject: test

Am I subscribed, please?

Claude Lepère
claudelep...@gmail.com


Virus-free.
https://urldefense.com/v3/__http://www.avg.com__;!!ACWV5N9M2RV99hQ!cvi5-sHyXOChV_ExNDg078CjKrIK8obM1a5K3rPyAseMb8-MdkkKSpCWRgPEPe9FTw$

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Log4j

2021-12-15 Thread Baris Kazar
Ok these are good to know.
thanks

From: Uwe Schindler 
Sent: Wednesday, December 15, 2021 5:18 PM
To: java-user@lucene.apache.org ; Ali Akhtar 

Cc: Baris Kazar 
Subject: Re: Log4j

Hi,

It only has an abstract logging interface inside IndexWriter to track actions 
done during indexing. But implementation of that is up to the application. By 
default you can only redirect to a file or stdout, if needed.

All other Apis log nothing.

Uwe

Am 15. Dezember 2021 21:58:59 UTC schrieb Ali Akhtar :

Does Lucene not have any internal logging at all, e.g for debugging?

On Thu, Dec 16, 2021 at 2:49 AM Uwe Schindler  wrote:

 Hi,

 Lucene is an API and does not log with log4j.

 Only the user interface Luke uses log4j, but this one does not do any
 networking. So unless user of Luke enters jndi expressions nothing can
 happen. 😂

 Uwe

 Am 15. Dezember 2021 21:41:37 UTC schrieb Baris Kazar <
 baris.ka...@oracle.com>:
Hi Folks,-
 Lucene is not affected by the latest bug, right?
I saw on Solr News page there are some fixes already made to Solr.
Best regards

 --
 Uwe Schindler
 Achterdiek 19, 28357 Bremen
 
https://www.thetaphi.de<https://urldefense.com/v3/__https://www.thetaphi.de__;!!ACWV5N9M2RV99hQ!Z3USh4shs6MbQE8ngi7x0l2sCSvfwLjnJx79W_2yhfpmPe6l9VE6tRzFUxMt9BDroA$>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de<https://urldefense.com/v3/__https://www.thetaphi.de__;!!ACWV5N9M2RV99hQ!Z3USh4shs6MbQE8ngi7x0l2sCSvfwLjnJx79W_2yhfpmPe6l9VE6tRzFUxMt9BDroA$>


Log4j

2021-12-15 Thread Baris Kazar
Hi Folks,-
 Lucene is not affected by the latest bug, right?
I saw on Solr News page there are some fixes already made to Solr.
Best regards


org.apache.lucene.index.memory.MemoryIndex

2021-10-06 Thread Baris Kazar
Hi,-
 Is there a project within Apache Lucene to extend this class to allow multiple 
results?
Best regards


Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-05 Thread Baris Kazar
Hi Adrien,-
 Is there a best practice paper or Lucene document that shows the
benefit of IndexWriter.forceMerge and merge() methods since You mentioned about 
too many segments.
and maybe show this concept on a toy dataset as a best practice example.
Best regards
baris


From: Baris Kazar 
Sent: Tuesday, October 5, 2021 3:56 PM
To: Adrien Grand ; Lucene Users Mailing List 
; Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hi Adrien,-
Thanks for taking a look at it and sure, that will be very nice to fix those 
accessors.
It is ok in terms of speed and i want more faster though.
Is there anything else i should look at to help make it faster?
Best regards


From: Adrien Grand 
Sent: Tuesday, October 5, 2021 3:18 PM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hmm we should fix these access$ accessors by fixing the visibility of some 
fields.

These breakdowns do not necessarily signal that something is wrong. Is the 
query executing fast overall?

On Mon, Oct 4, 2021 at 11:57 PM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
Hi, -
I did more experiments and this time i looked into these methods:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()


Lets start with BooleanWeight.bulkScorer() with its call tree and time spent:


BooleanWeight.bulkScorer()
-->> Weight.bulkScorer()
-->>-->> BooleanWeight.scorer()
-->>-->>-->>BooleanWeight.scorerSupplier()
-->>-->>-->>-->> Weight.scorerSupplier()
-->>-->>-->>-->>-->> TermQuery$Termweight.scorer()
-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.impacts()
-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader.impacts()
-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$BlockImpactsDocEnums.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>  
org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.store.DataInput.readVLong() (constittutes %100 of 
BooleanWeight.bulkScorer() time here)



Next: BulkScorer.score() with its call tree and time spent:



BulkScorer.score()
-->> Weight$DefaultBulkScorer.score()
-->>-->> Weight$DefaultBulkScorer.scoreAll()
-->>-->>-->> WANDScorer$1.nextDoc()
-->>-->>-->>-->> WANDScorer$1.advance()
-->>-->>-->>-->>-->> WANDScorer.access$300() (constitutes %65 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$100() (constitutes %30 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$400() (constitutes %5 of 
BulkScorer.score() time here)

Best regards


From: Baris Kazar mailto:baris.ka...@oracle.com>>
Sent: Saturday, October 2, 2021 3:14 PM
To: Adrien Grand mailto:jpou...@gmail.com>>; Lucene Users 
Mailing List mailto:java-user@lucene.apache.org>>
Cc: Baris Kazar mailto:baris.ka...@oracle.com>>
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hi Adrien,-
Thanks. Let me see next week the components (units, methods) within 
BulkScorer#score to see what takes most time among its called methods.

Jvisualvm reports for a method whole time including the time spent in the 
called methods and when you go down the execution tree it goes until the very 
last called method.

Regarding the second paragraph above:
when will there be too many segments in the Lucene index? i have 1 text field 
and 1 stored (non indexed) field.

I most of the time get a couple of thousands hits and i ask for top 20 of them. 
Could this be leading to
BooleanWeight#bulkScorer spending time?

Both of these units:
BooleanWeight#bulkScorer and BulkScorer#score spend equal amounts of time and 
totally make up
75% of IndexSearcher#search as i mentioned before.

Thanks for the swift reply
I appreciate very much


Best regards

From: Adrien Grand mailto:jpou...@gmail.com>>
Sent:

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-05 Thread Baris Kazar
Hi Adrien,-
Thanks for taking a look at it and sure, that will be very nice to fix those 
accessors.
It is ok in terms of speed and i want more faster though.
Is there anything else i should look at to help make it faster?
Best regards


From: Adrien Grand 
Sent: Tuesday, October 5, 2021 3:18 PM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hmm we should fix these access$ accessors by fixing the visibility of some 
fields.

These breakdowns do not necessarily signal that something is wrong. Is the 
query executing fast overall?

On Mon, Oct 4, 2021 at 11:57 PM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
Hi, -
I did more experiments and this time i looked into these methods:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()


Lets start with BooleanWeight.bulkScorer() with its call tree and time spent:


BooleanWeight.bulkScorer()
-->> Weight.bulkScorer()
-->>-->> BooleanWeight.scorer()
-->>-->>-->>BooleanWeight.scorerSupplier()
-->>-->>-->>-->> Weight.scorerSupplier()
-->>-->>-->>-->>-->> TermQuery$Termweight.scorer()
-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.impacts()
-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader.impacts()
-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$BlockImpactsDocEnums.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>  
org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.store.DataInput.readVLong() (constittutes %100 of 
BooleanWeight.bulkScorer() time here)



Next: BulkScorer.score() with its call tree and time spent:



BulkScorer.score()
-->> Weight$DefaultBulkScorer.score()
-->>-->> Weight$DefaultBulkScorer.scoreAll()
-->>-->>-->> WANDScorer$1.nextDoc()
-->>-->>-->>-->> WANDScorer$1.advance()
-->>-->>-->>-->>-->> WANDScorer.access$300() (constitutes %65 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$100() (constitutes %30 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$400() (constitutes %5 of 
BulkScorer.score() time here)

Best regards


From: Baris Kazar mailto:baris.ka...@oracle.com>>
Sent: Saturday, October 2, 2021 3:14 PM
To: Adrien Grand mailto:jpou...@gmail.com>>; Lucene Users 
Mailing List mailto:java-user@lucene.apache.org>>
Cc: Baris Kazar mailto:baris.ka...@oracle.com>>
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hi Adrien,-
Thanks. Let me see next week the components (units, methods) within 
BulkScorer#score to see what takes most time among its called methods.

Jvisualvm reports for a method whole time including the time spent in the 
called methods and when you go down the execution tree it goes until the very 
last called method.

Regarding the second paragraph above:
when will there be too many segments in the Lucene index? i have 1 text field 
and 1 stored (non indexed) field.

I most of the time get a couple of thousands hits and i ask for top 20 of them. 
Could this be leading to
BooleanWeight#bulkScorer spending time?

Both of these units:
BooleanWeight#bulkScorer and BulkScorer#score spend equal amounts of time and 
totally make up
75% of IndexSearcher#search as i mentioned before.

Thanks for the swift reply
I appreciate very much


Best regards

From: Adrien Grand mailto:jpou...@gmail.com>>
Sent: Saturday, October 2, 2021 1:44:40 AM
To: Lucene Users Mailing List 
mailto:java-user@lucene.apache.org>>
Cc: Baris Kazar mailto:baris.ka...@oracle.com>>
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Is your profiler reporting inclusive or exclusive costs for each function? Ie. 
does it exclude time spent in functions that are called within a function? I'm 
asking because it makes total sense for IndexSearcher#search to spend most of 
its time is BulkScore

Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-04 Thread Baris Kazar
Hi, -
I did more experiments and this time i looked into these methods:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()


Lets start with BooleanWeight.bulkScorer() with its call tree and time spent:


BooleanWeight.bulkScorer()
-->> Weight.bulkScorer()
-->>-->> BooleanWeight.scorer()
-->>-->>-->>BooleanWeight.scorerSupplier()
-->>-->>-->>-->> Weight.scorerSupplier()
-->>-->>-->>-->>-->> TermQuery$Termweight.scorer()
-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.impacts()
-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader.impacts()
-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$BlockImpactsDocEnums.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>  
org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.init()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels()
-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->>-->> 
org.apache.lucene.store.DataInput.readVLong() (constittutes %100 of 
BooleanWeight.bulkScorer() time here)



Next: BulkScorer.score() with its call tree and time spent:



BulkScorer.score()
-->> Weight$DefaultBulkScorer.score()
-->>-->> Weight$DefaultBulkScorer.scoreAll()
-->>-->>-->> WANDScorer$1.nextDoc()
-->>-->>-->>-->> WANDScorer$1.advance()
-->>-->>-->>-->>-->> WANDScorer.access$300() (constitutes %65 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$100() (constitutes %30 of 
BulkScorer.score() time here)
-->>-->>-->>-->>-->> WANDScorer.access$400() (constitutes %5 of 
BulkScorer.score() time here)

Best regards


From: Baris Kazar 
Sent: Saturday, October 2, 2021 3:14 PM
To: Adrien Grand ; Lucene Users Mailing List 

Cc: Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Hi Adrien,-
Thanks. Let me see next week the components (units, methods) within 
BulkScorer#score to see what takes most time among its called methods.

Jvisualvm reports for a method whole time including the time spent in the 
called methods and when you go down the execution tree it goes until the very 
last called method.

Regarding the second paragraph above:
when will there be too many segments in the Lucene index? i have 1 text field 
and 1 stored (non indexed) field.

I most of the time get a couple of thousands hits and i ask for top 20 of them. 
Could this be leading to
BooleanWeight#bulkScorer spending time?

Both of these units:
BooleanWeight#bulkScorer and BulkScorer#score spend equal amounts of time and 
totally make up
75% of IndexSearcher#search as i mentioned before.

Thanks for the swift reply
I appreciate very much


Best regards

From: Adrien Grand 
Sent: Saturday, October 2, 2021 1:44:40 AM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Is your profiler reporting inclusive or exclusive costs for each function? Ie. 
does it exclude time spent in functions that are called within a function? I'm 
asking because it makes total sense for IndexSearcher#search to spend most of 
its time is BulkScorer#score, which coordinates the whole matching+scoring 
process.

Having much time spent in BooleanWeight#bulkScorer is a bit surprising however. 
This suggests that you have too many segments in your index (since the bulk 
scorer needs to be recreated for every segment) or that your average query 
matches a very low number of documents (so that Lucene spends more time 
figuring out how best to find the matches versus actually finding these 
matches).

On Sat, Oct 2, 2021 at 5:57 AM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
Hi,-
 I performance profiled my application via jvisualvm on Java
and saw that 75% of the search process from
org.apache.lucene.search.IndexSearcher.search() are spent on
these units:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
Is there any study or project to speed up these please?

Best regards



--
Adrien


Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-02 Thread Baris Kazar
Hi Adrien,-
Thanks. Let me see next week the components (units, methods) within 
BulkScorer#score to see what takes most time among its called methods.

Jvisualvm reports for a method whole time including the time spent in the 
called methods and when you go down the execution tree it goes until the very 
last called method.

Regarding the second paragraph above:
when will there be too many segments in the Lucene index? i have 1 text field 
and 1 stored (non indexed) field.

I most of the time get a couple of thousands hits and i ask for top 20 of them. 
Could this be leading to
BooleanWeight#bulkScorer spending time?

Both of these units:
BooleanWeight#bulkScorer and BulkScorer#score spend equal amounts of time and 
totally make up
75% of IndexSearcher#search as i mentioned before.

Thanks for the swift reply
I appreciate very much


Best regards

From: Adrien Grand 
Sent: Saturday, October 2, 2021 1:44:40 AM
To: Lucene Users Mailing List 
Cc: Baris Kazar 
Subject: Re: org.apache.lucene.search.BooleanWeight.bulkScorer() and 
BulkScorer.score()

Is your profiler reporting inclusive or exclusive costs for each function? Ie. 
does it exclude time spent in functions that are called within a function? I'm 
asking because it makes total sense for IndexSearcher#search to spend most of 
its time is BulkScorer#score, which coordinates the whole matching+scoring 
process.

Having much time spent in BooleanWeight#bulkScorer is a bit surprising however. 
This suggests that you have too many segments in your index (since the bulk 
scorer needs to be recreated for every segment) or that your average query 
matches a very low number of documents (so that Lucene spends more time 
figuring out how best to find the matches versus actually finding these 
matches).

On Sat, Oct 2, 2021 at 5:57 AM Baris Kazar 
mailto:baris.ka...@oracle.com>> wrote:
Hi,-
 I performance profiled my application via jvisualvm on Java
and saw that 75% of the search process from
org.apache.lucene.search.IndexSearcher.search() are spent on
these units:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
Is there any study or project to speed up these please?

Best regards



--
Adrien


org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()

2021-10-01 Thread Baris Kazar
Hi,-
 I performance profiled my application via jvisualvm on Java
and saw that 75% of the search process from
org.apache.lucene.search.IndexSearcher.search() are spent on
these units:
org.apache.lucene.search.BooleanWeight.bulkScorer() and BulkScorer.score()
Is there any study or project to speed up these please?

Best regards



Re: Potential bug

2021-06-14 Thread Baris Kazar
i was clear on what i wanted to do with Lucene experiments in this thread.
(last part of first paragraph below)

Best regards

From: Baris Kazar 
Sent: Monday, June 14, 2021 10:28:47 AM
To: Atri Sharma ; java-user@lucene.apache.org 
; a.benede...@sease.io ; 
Baris Kazar 
Subject: Re: Potential bug

Dear Folks,-
 i have a lot of experience in performance tuning and parallel processing: 17+7 
years. So, when you say "you dont know what you ask for", that does not sound 
good at all besides i was clear on that.

Alessandro, i appreciate the apology and i would like to apologize if i hurt 
feelings and i never mean to hurt anybody's feelings and i still think i was 
not aggressive but i need to re-explain
what was wrong with the email:

I was not trying to be aggressive with my responses.
I write in this forum for a long time and never received an email like Yours.

I revised your email for this list. Because with my expertise, i dont think i 
should get a comment like the X Y problem example.

Moreover code can have bugs and raising is not a good word choice here. I am 
not here to find problems with Lucene and we are all here to use and make 
Lucene better.

And i appreciate the work committers as volunteers are doing and there is no 
doubt there. Lucene 8.y.z is much better with your work. Kudos to that success.

We need to keep the tone neutral is what i am looking for here. Yes, respect is 
fundemantal,
that is what i have been telling here in my last emails.

Would You please look at my revised email?
I think the email should have been composed
that way.

I would like to focus on my question please.
I hope we keep the tone neutral and professional.
Thanks for understanding.

Best regards

From: Atri Sharma 
Sent: Monday, June 14, 2021 8:46 AM
To: java-user@lucene.apache.org
Cc: Baris Kazar
Subject: Re: Potential bug

+1 to Adrien.

Let's keep the tone neutral.

On Mon, 14 Jun 2021, 16:00 Adrien Grand, 
mailto:jpou...@gmail.com>> wrote:
Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.

+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti 
mailto:a.benede...@sease.io>>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io<https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!K0ZsQ2P0XzGClQmwefzD5RkmOCe4LzH2fc3siXNLAGO0TRzuPqXWRmuqmOPHCWMakg$>
>
>
> On Fri, 11 Jun 2021 at 17:10, 
> mailto:baris.ka...@oracle.com>> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> >   * matching phase (identifying candidates from the corpus of documents)
> >   * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should 

Re: Potential bug

2021-06-14 Thread Baris Kazar
Dear Folks,-
 i have a lot of experience in performance tuning and parallel processing: 17+7 
years. So, when you say "you dont know what you ask for", that does not sound 
good at all besides i was clear on that.

Alessandro, i appreciate the apology and i would like to apologize if i hurt 
feelings and i never mean to hurt anybody's feelings and i still think i was 
not aggressive but i need to re-explain
what was wrong with the email:

I was not trying to be aggressive with my responses.
I write in this forum for a long time and never received an email like Yours.

I revised your email for this list. Because with my expertise, i dont think i 
should get a comment like the X Y problem example.

Moreover code can have bugs and raising is not a good word choice here. I am 
not here to find problems with Lucene and we are all here to use and make 
Lucene better.

And i appreciate the work committers as volunteers are doing and there is no 
doubt there. Lucene 8.y.z is much better with your work. Kudos to that success.

We need to keep the tone neutral is what i am looking for here. Yes, respect is 
fundemantal,
that is what i have been telling here in my last emails.

Would You please look at my revised email?
I think the email should have been composed
that way.

I would like to focus on my question please.
I hope we keep the tone neutral and professional.
Thanks for understanding.

Best regards

From: Atri Sharma 
Sent: Monday, June 14, 2021 8:46 AM
To: java-user@lucene.apache.org
Cc: Baris Kazar
Subject: Re: Potential bug

+1 to Adrien.

Let's keep the tone neutral.

On Mon, 14 Jun 2021, 16:00 Adrien Grand, 
mailto:jpou...@gmail.com>> wrote:
Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.

+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti 
mailto:a.benede...@sease.io>>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io<https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!K0ZsQ2P0XzGClQmwefzD5RkmOCe4LzH2fc3siXNLAGO0TRzuPqXWRmuqmOPHCWMakg$>
>
>
> On Fri, 11 Jun 2021 at 17:10, 
> mailto:baris.ka...@oracle.com>> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> >   * matching phase (identifying candidates from the corpus of documents)
> >   * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should identify where are you wasting most of your time, in
> the
> > > matching phase (identifying candidates from the corpus of documents) or
> > in
> > > the ranking phase (scoring them by relevance)?
> > >
> > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > &g

Re: Potential bug

2021-06-11 Thread baris . kazar

Let me guide to a professional answer to the below email:


Hi Baris,

Since You mentioned You did all the performance study on your 
application and still believe that


the bottleneck is the fuzzy search api from Lucene, it would be best to 
time the application for:


 * matching phase (identifying candidates from the corpus of documents)
 * or in the ranking phase (scoring them by relevance)?

Maybe this will help speedup further.

Also, what do You mean by "what is the user needs to to limit te search 
process" ? can you elaborate?


Cheers



My answer would be :

i cant access the Lucene code so how can time these two cases please?

i mean by that sentence that when i see the hits are good i would like 
to limit the number of hits.




this is more like a professional conversation please. Thanks.

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:

Hi Bazir,
this feels like an X Y problem [1 
].
Can you express what is your original user requirement?
Most of the time, at the cost of indexing time/space you may get quicker
query times.
Also, you should identify where are you wasting most of your time, in the
matching phase (identifying candidates from the corpus of documents) or in
the ranking phase (scoring them by relevance)?

TopScoreDocCollector is quite a solid class, there's a ton to study,
analyze and experiment before raising the alarm of a bug :)

Also didn't understand this :
"what if the user needs to limit the search process?"
Can you elaborate?

Cheers



[1] 
https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$


On Wed, 9 Jun 2021 at 19:08,  wrote:


Yes, i did those and i believe i am at the best level of performance now
and it is not bad at all but i want to make it much better.

i see like a linear drop in timings when i go lower number of words but
let me do that quick study again.

Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But
thanks again.

Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

I have never used fuzzy search but from the documentation it seems very

expensive, and if you do it on 10 terms and 1M documents it seems very very
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up

exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy

(just to see if is fuzzy the introduce the slow down)?

Or what happens to performance if you do fuzzy with 1, 2, 5 terms

instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1

million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10

words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too

long.

What do you mean by 'simple' query? there might be multiple reasons

behind

slowness of a query that are unrelated to the search (for example, if

you

retrieve many documents and for each document you are extracting the

content

of

many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but

Re: Potential bug

2021-06-11 Thread baris . kazar



i expect the answers from this list to be more professional please.

You dont have to answer to this list if you intend to insult.

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:


Hi Bazir,
this feels like an X Y problem [1 
].
Can you express what is your original user requirement?
Most of the time, at the cost of indexing time/space you may get quicker
query times.
Also, you should identify where are you wasting most of your time, in the
matching phase (identifying candidates from the corpus of documents) or in
the ranking phase (scoring them by relevance)?

TopScoreDocCollector is quite a solid class, there's a ton to study,
analyze and experiment before raising the alarm of a bug :)

Also didn't understand this :
"what if the user needs to limit the search process?"
Can you elaborate?

Cheers



[1] 
https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$


On Wed, 9 Jun 2021 at 19:08,  wrote:


Yes, i did those and i believe i am at the best level of performance now
and it is not bad at all but i want to make it much better.

i see like a linear drop in timings when i go lower number of words but
let me do that quick study again.

Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But
thanks again.

Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

I have never used fuzzy search but from the documentation it seems very

expensive, and if you do it on 10 terms and 1M documents it seems very very
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up

exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy

(just to see if is fuzzy the introduce the slow down)?

Or what happens to performance if you do fuzzy with 1, 2, 5 terms

instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1

million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10

words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too

long.

What do you mean by 'simple' query? there might be multiple reasons

behind

slowness of a query that are unrelated to the search (for example, if

you

retrieve many documents and for each document you are extracting the

content

of

many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum

threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start

collecting

hits naively in doc ID order and to progressively raise the bar about

the

minimum sco

Re: Potential bug

2021-06-11 Thread baris . kazar

Lets start with writing my name correctly.

Then we can talk

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:

Hi Bazir,
this feels like an X Y problem [1 
].
Can you express what is your original user requirement?
Most of the time, at the cost of indexing time/space you may get quicker
query times.
Also, you should identify where are you wasting most of your time, in the
matching phase (identifying candidates from the corpus of documents) or in
the ranking phase (scoring them by relevance)?

TopScoreDocCollector is quite a solid class, there's a ton to study,
analyze and experiment before raising the alarm of a bug :)

Also didn't understand this :
"what if the user needs to limit the search process?"
Can you elaborate?

Cheers



[1] 
https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$


On Wed, 9 Jun 2021 at 19:08,  wrote:


Yes, i did those and i believe i am at the best level of performance now
and it is not bad at all but i want to make it much better.

i see like a linear drop in timings when i go lower number of words but
let me do that quick study again.

Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But
thanks again.

Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

I have never used fuzzy search but from the documentation it seems very

expensive, and if you do it on 10 terms and 1M documents it seems very very
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up

exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy

(just to see if is fuzzy the introduce the slow down)?

Or what happens to performance if you do fuzzy with 1, 2, 5 terms

instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1

million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10

words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too

long.

What do you mean by 'simple' query? there might be multiple reasons

behind

slowness of a query that are unrelated to the search (for example, if

you

retrieve many documents and for each document you are extracting the

content

of

many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum

threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start

collecting

hits naively in doc ID order and to progressively raise the bar about

the

minimum score that is required for a hit to be competitive in order

to skip

non-co

Re: Potential bug

2021-06-09 Thread baris . kazar
Yes, i did those and i believe i am at the best level of performance now 
and it is not bad at all but i want to make it much better.


i see like a linear drop in timings when i go lower number of words but 
let me do that quick study again.


Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But 
thanks again.


Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

I have never used fuzzy search but from the documentation it seems very 
expensive, and if you do it on 10 terms and 1M documents it seems very very 
very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up 
exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy (just 
to see if is fuzzy the introduce the slow down)?
Or what happens to performance if you do fuzzy with 1, 2, 5 terms instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:

java-user@lucene.apache.org,  baris.ka...@oracle.com

Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind

slowness of a query that are unrelated to the search (for example, if you
retrieve many documents and for each document you are extracting the content

of

many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

  i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To uns

Re: Potential bug

2021-06-09 Thread baris . kazar

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene 
but the bottle neck is lucene fuzzy search.


Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:  
java-user@lucene.apache.org,  baris.ka...@oracle.com
Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind

slowness of a query that are unrelated to the search (for example, if you
retrieve many documents and for each document you are extracting the content of
many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:

java-user@lucene.apache.org

Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

 i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Potential bug

2021-06-09 Thread baris . kazar
i have only two fields one string the other is a number (stored as 
string), i guess you cant go simpler than this.


i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want 
thousands of hits.



Best regards






On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:


Hi Baris,


what if the user needs to limit the search process?

What do you mean by 'limit'?


there should be a way to speedup lucene then if this is not possible,
since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind 
slowness of a query that are unrelated to the search (for example, if you 
retrieve many documents and for each document you are extracting the content of 
many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:  
java-user@lucene.apache.org
Cc:  baris.ka...@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Potential bug

2021-06-09 Thread baris . kazar

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:

Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM  wrote:


Hi,-

   i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Potential bug

2021-06-09 Thread baris . kazar

Hi,-

 i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as 
1655 but i get 10 results in total.


I think this suggests that there might be a bug with 
TopScoreDocCollector algorithm.



Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TopScoreDocCollector class usage

2021-06-09 Thread baris . kazar

Ok i found it

300 times number of words in the search string but these needs to be 
precisely documented in the Javadocs


i dont want to have trial and error and i guess nobody wants that, 
either please.



Best regards



On 6/9/21 12:11 PM, baris.ka...@oracle.com wrote:

Hi,-

 i used this class now before IndexSearher.search api (with collector 
as 2nd arg) (Please see the "an interesting case" thread before this 
question)



but this time i have a very weird behavior:


i used to have 4000+ hits with default TopScoreDocCollector.create(int 
numHits,  ScoreDoc after, int totalHitsThreshold)


internal usage in IndexSearcher.search api which is 1000 and i set 
after as null here.



Now when i set totalHitsThreshold and numHits in 
TopScoreDocCollector.create to 300


i get 12200+ hits now from totalHits object.


Something is not right here, right?

How can it jump to 3 times when i set totalHitsThreshold as ~ 1/3 of 
default value of totalHitsThreshold and numHits?



Best regards



ps.

NOTE: The search(org.apache.lucene.search.Query, int) and 
searchAfter(org.apache.lucene.search.ScoreDoc, 
org.apache.lucene.search.Query, int) methods are configured to only 
count top hits accurately up to 1,000 and may return a lower bound of 
the hit count if the hit count is greater than or equal to 1,000. On 
queries that match lots of documents, counting the number of hits may 
take much longer than computing the top hits so this trade-off allows 
to get some minimal information about the hit count without slowing 
down search too much. The TopDocs.scoreDocs array is always accurate 
however. If this behavior doesn't suit your needs, you should create 
collectors manually with either TopScoreDocCollector.create(int, int) 
or TopFieldCollector.create(org.apache.lucene.search.Sort, int, int) 
and call search(Query, Collector).



at


https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/search/IndexSearcher.html#searchAfter-org.apache.lucene.search.ScoreDoc-org.apache.lucene.search.Query-int- 





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



TopScoreDocCollector class usage

2021-06-09 Thread baris . kazar

Hi,-

 i used this class now before IndexSearher.search api (with collector 
as 2nd arg) (Please see the "an interesting case" thread before this 
question)



but this time i have a very weird behavior:


i used to have 4000+ hits with default TopScoreDocCollector.create(int 
numHits,  ScoreDoc after, int totalHitsThreshold)


internal usage in IndexSearcher.search api which is 1000 and i set after 
as null here.



Now when i set totalHitsThreshold and numHits in 
TopScoreDocCollector.create to 300


i get 12200+ hits now from totalHits object.


Something is not right here, right?

How can it jump to 3 times when i set totalHitsThreshold as ~ 1/3 of 
default value of totalHitsThreshold and numHits?



Best regards



ps.

NOTE: The search(org.apache.lucene.search.Query, int) and 
searchAfter(org.apache.lucene.search.ScoreDoc, 
org.apache.lucene.search.Query, int) methods are configured to only 
count top hits accurately up to 1,000 and may return a lower bound of 
the hit count if the hit count is greater than or equal to 1,000. On 
queries that match lots of documents, counting the number of hits may 
take much longer than computing the top hits so this trade-off allows to 
get some minimal information about the hit count without slowing down 
search too much. The TopDocs.scoreDocs array is always accurate however. 
If this behavior doesn't suit your needs, you should create collectors 
manually with either TopScoreDocCollector.create(int, int) or 
TopFieldCollector.create(org.apache.lucene.search.Sort, int, int) and 
call search(Query, Collector).



at


https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/search/IndexSearcher.html#searchAfter-org.apache.lucene.search.ScoreDoc-org.apache.lucene.search.Query-int-


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: An interesting case

2021-06-08 Thread baris . kazar
gy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know
why totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n
(<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n),
there are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the
TopDocs object
>>>> will have a ScoreDoc[] array that contains the n best
scoring hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about
the total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more
recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the
total number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change in
>>>> the future.
>>>>
>>>> If you want to count the number of matches of a Query
precisely, you
>>>> can use IndexSearcher#count.
>>>>
>>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>
>>>> <mailto:baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>>> wrote:
>>>>
>>>>
>>>

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$>

>>>
>>>>  <
>>>

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

>>>
>>>>
>>>>  looks like someone else also had this problem, too.
>>>>
>>>>  Any suggestions please?
>>>>
>>>>  Best regards
>>>>
>>>>
>>>>  On 6/8/21 1:36 AM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>
>>>>  <mailto:baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com>> wrote:
>>>>  > Hi,-
>>>>  >
>>>>  >  I use IndexSearcher.search API with two parameters
like Query
>>>>  and int
>>>>  > number (i set as 20).
>>>>  >
   

Re: An interesting case

2021-06-08 Thread baris . kazar

May i please again suggest?

the Javadocs need to be enhanced for Lucene

There needs to be more info and explain parameters and

more importantly in terms of performance why these two classes 
(TopScoreDocsCollector vs IndexSearcher) differ for performance.



Thanks


On 6/8/21 2:07 PM, baris.ka...@oracle.com wrote:


yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.

So TopScoreDocsCollector is working underneath IndexSearcher.search 
api, right?


in other words TopScoreDocsCollector will be saving time, right?

Thanks


On 6/8/21 1:27 PM, Adrien Grand wrote:
Yes, for instance if you care about the top 10 hits only, you could 
call TopScoreDocsCollector.create(10, null, 10). By default, 
IndexSearcher is configured to count at least 1,000 hits, and creates 
its top docs collector with TopScoreDocsCollector.create(10, null, 1000).


On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote:


Ok i think you meant something else here.

you are not refering to total number of hits calculation or the
mismatch, right?



so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Adrien my concern is not actually the number mismatch
>
> as i mentioned it is the performance.
>
>
> seeing those numbers mismatch it seems that lucene is still
doing same
>
> amount of work to get results no matter how many results you
need in
> the indexsearcher search api.
>
>
> i thought i was clear on that.
>
>
> Lucene should not spend any energy for the count as scoredocs
already
> has that.
>
> But seeing totalhits high number, that worries me as i
explained above.
>
>
> Best regards
>
>
> On 6/8/21 1:12 PM, Adrien Grand wrote:
>> If you don't need any information about the total hit count,
you could
>> create a TopScoreDocCollector that has the same value for numHits
>> and totalHitsThreshold. This way Lucene will spend as little
energy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know why
totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n
(<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n),
there are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the
TopDocs object
>>>> will have a ScoreDoc[] array that contains the n best
scoring hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about the
total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the total
number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change i

Re: An interesting case

2021-06-08 Thread baris . kazar

yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.

So TopScoreDocsCollector is working underneath IndexSearcher.search api, 
right?


in other words TopScoreDocsCollector will be saving time, right?

Thanks


On 6/8/21 1:27 PM, Adrien Grand wrote:
Yes, for instance if you care about the top 10 hits only, you could 
call TopScoreDocsCollector.create(10, null, 10). By default, 
IndexSearcher is configured to count at least 1,000 hits, and creates 
its top docs collector with TopScoreDocsCollector.create(10, null, 1000).


On Tue, Jun 8, 2021 at 7:19 PM <mailto:baris.ka...@oracle.com>> wrote:


Ok i think you meant something else here.

you are not refering to total number of hits calculation or the
mismatch, right?



so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Adrien my concern is not actually the number mismatch
>
> as i mentioned it is the performance.
>
>
> seeing those numbers mismatch it seems that lucene is still
doing same
>
> amount of work to get results no matter how many results you
need in
> the indexsearcher search api.
>
>
> i thought i was clear on that.
>
>
> Lucene should not spend any energy for the count as scoredocs
already
> has that.
>
> But seeing totalhits high number, that worries me as i explained
above.
>
>
> Best regards
>
>
> On 6/8/21 1:12 PM, Adrien Grand wrote:
>> If you don't need any information about the total hit count,
you could
>> create a TopScoreDocCollector that has the same value for numHits
>> and totalHitsThreshold. This way Lucene will spend as little
energy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM mailto:baris.ka...@oracle.com>> wrote:
>>
>>> i am currently happy with Lucene performance but i want to
understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know why
totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n (<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>>


>>>>
>>>> *From:* Adrien Grand mailto:jpou...@gmail.com>>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n), there
are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the TopDocs
object
>>>> will have a ScoreDoc[] array that contains the n best scoring
hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the
TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by
descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about the
total
>>>> number of matches of the query. On older versions of Lucene
(<7.0)
>>>> this is an integer that is always accurate, while on more recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the total
number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might
change in
>>>> the future.
>>>>
>>>> If you want to count the number of matches of a Query
precisely, you
>>>> can use IndexSearcher#count.
>>>>
>>>> On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>
>>>>

Re: An interesting case

2021-06-08 Thread baris . kazar

Ok i think you meant something else here.

you are not refering to total number of hits calculation or the 
mismatch, right?




so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.ka...@oracle.com wrote:

Adrien my concern is not actually the number mismatch

as i mentioned it is the performance.


seeing those numbers mismatch it seems that lucene is still doing same

amount of work to get results no matter how many results you need in 
the indexsearcher search api.



i thought i was clear on that.


Lucene should not spend any energy for the count as scoredocs already 
has that.


But seeing totalhits high number, that worries me as i explained above.


Best regards


On 6/8/21 1:12 PM, Adrien Grand wrote:

If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:


i am currently happy with Lucene performance but i want to understand
and speedup further

by limiting the results concretely. So i still donot know why totalHits
and scoredocs report

different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a
couple of 1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards
 


*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two
cases:
  - either your query matches n hits or more, and the TopDocs object
will have a ScoreDoc[] array that contains the n best scoring hits
sorted by descending score,
  - or your query matches less then n hits and then the TopDocs object
will have all matches in the ScoreDoc[] array, sorted by descending

score.

In both cases, TopDocs#totalHits gives information about the total
number of matches of the query. On older versions of Lucene (<7.0)
this is an integer that is always accurate, while on more recent
versions of Lucene (>= 8.0) it is a lower bound of the total number of
matches. It typically returns the number of collected documents
indeed, though this is an implementation detail that might change in
the future.

If you want to count the number of matches of a Query precisely, you
can use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote:


https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$ 


 <
https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$ 



 looks like someone else also had this problem, too.

 Any suggestions please?

 Best regards


 On 6/8/21 1:36 AM, baris.ka...@oracle.com
 <mailto:baris.ka...@oracle.com> wrote:
 > Hi,-
 >
 >  I use IndexSearcher.search API with two parameters like Query
 and int
 > number (i set as 20).
 >
 > However, when i look at the TopDocs object which is the result
 of this
 > above API call
 >
 > i see thousands of hits from totalhits. Is this inaccurate or
 Lucene
 > is doing actually search based on that many results?
 >
 > But when i iterate over result of above API call's scoreDocs
 object i
 > get int number of hits (ie, 20 hits).
 >
 >
 > I am trying to find out why
 org.apache.lucene.search.Topdocs.TotalHits
 > report a number of collected results than
 >
 > the actual number of results. I see on the order of couple of
 > thousands vs 20.
 >
 >
 > Best regards
 >
 >
 >

-
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 <mailto:java-user-unsubscr...@lucene.apache.org>
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 <mailto:java-user-h...@lucene.apache.org>



--
Adrien




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: An interesting case

2021-06-08 Thread baris . kazar

Adrien my concern is not actually the number mismatch

as i mentioned it is the performance.


seeing those numbers mismatch it seems that lucene is still doing same

amount of work to get results no matter how many results you need in the 
indexsearcher search api.



i thought i was clear on that.


Lucene should not spend any energy for the count as scoredocs already 
has that.


But seeing totalhits high number, that worries me as i explained above.


Best regards


On 6/8/21 1:12 PM, Adrien Grand wrote:

If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM  wrote:


i am currently happy with Lucene performance but i want to understand
and speedup further

by limiting the results concretely. So i still donot know why totalHits
and scoredocs report

different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a
couple of 1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two
cases:
  - either your query matches n hits or more, and the TopDocs object
will have a ScoreDoc[] array that contains the n best scoring hits
sorted by descending score,
  - or your query matches less then n hits and then the TopDocs object
will have all matches in the ScoreDoc[] array, sorted by descending

score.

In both cases, TopDocs#totalHits gives information about the total
number of matches of the query. On older versions of Lucene (<7.0)
this is an integer that is always accurate, while on more recent
versions of Lucene (>= 8.0) it is a lower bound of the total number of
matches. It typically returns the number of collected documents
indeed, though this is an implementation detail that might change in
the future.

If you want to count the number of matches of a Query precisely, you
can use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM mailto:baris.ka...@oracle.com>> wrote:



https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$

 <

https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$


 looks like someone else also had this problem, too.

 Any suggestions please?

 Best regards


 On 6/8/21 1:36 AM, baris.ka...@oracle.com
 <mailto:baris.ka...@oracle.com> wrote:
 > Hi,-
 >
 >  I use IndexSearcher.search API with two parameters like Query
 and int
 > number (i set as 20).
 >
 > However, when i look at the TopDocs object which is the result
 of this
 > above API call
 >
 > i see thousands of hits from totalhits. Is this inaccurate or
 Lucene
 > is doing actually search based on that many results?
 >
 > But when i iterate over result of above API call's scoreDocs
 object i
 > get int number of hits (ie, 20 hits).
 >
 >
 > I am trying to find out why
 org.apache.lucene.search.Topdocs.TotalHits
 > report a number of collected results than
 >
 > the actual number of results. I see on the order of couple of
 > thousands vs 20.
 >
 >
 > Best regards
 >
 >
 >

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 <mailto:java-user-unsubscr...@lucene.apache.org>
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 <mailto:java-user-h...@lucene.apache.org>



--
Adrien




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: On which field document is searched

2021-06-08 Thread baris . kazar

I guess you can setup an experiment like

search your text against each field and then look at the score but you 
need to normalize the score in order to compare and


normalization will include probably length of the field etc.

Maybe there is an api in lucene for this but i dont know.

Hope this helps

Best regards


On 6/8/21 4:53 AM, Vivek Gobhil wrote:


Hi,

I am creating a full text search API and one of my requirement is to 
find out which exact field the input text is matched to if the 
document has say more than 10 fields.


Is there any way I can find out what is the most relevant field in the 
document against the input search text.


Thanks in advance.

*—*
*Vivek Gobhil*
Senior Technology Architect
Precisely.com 







ATTENTION: -
The information contained in this message (including any files 
transmitted with this message) may contain proprietary, trade secret 
or other confidential and/or legally privileged information. Any 
pricing information contained in this message or in any files 
transmitted with this message is always confidential and cannot be 
shared with any third parties without prior written approval from 
Precisely. This message is intended to be read only by the individual 
or entity to whom it is addressed or by their designee. If the reader 
of this message is not the intended recipient, you are on notice that 
any use, disclosure, copying or distribution of this message, in any 
form, is strictly prohibited. If you have received this message in 
error, please immediately notify the sender and/or Precisely and 
destroy all copies of this message in your possession, custody or control.




Re: An interesting case

2021-06-08 Thread baris . kazar
i am currently happy with Lucene performance but i want to understand 
and speedup further


by limiting the results concretely. So i still donot know why totalHits 
and scoredocs report


different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:

my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a 
couple of 1000s) hits, then this creates performance issue.


ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

*From:* Adrien Grand 
*Sent:* Tuesday, June 8, 2021 2:46 AM
*To:* Lucene Users Mailing List
*Cc:* Baris Kazar
*Subject:* Re: An interesting case
When you call IndexSearcher#search(Query query, int n), there are two 
cases:
 - either your query matches n hits or more, and the TopDocs object 
will have a ScoreDoc[] array that contains the n best scoring hits 
sorted by descending score,
 - or your query matches less then n hits and then the TopDocs object 
will have all matches in the ScoreDoc[] array, sorted by descending score.


In both cases, TopDocs#totalHits gives information about the total 
number of matches of the query. On older versions of Lucene (<7.0) 
this is an integer that is always accurate, while on more recent 
versions of Lucene (>= 8.0) it is a lower bound of the total number of 
matches. It typically returns the number of collected documents 
indeed, though this is an implementation detail that might change in 
the future.


If you want to count the number of matches of a Query precisely, you 
can use IndexSearcher#count.


On Tue, Jun 8, 2021 at 7:51 AM <mailto:baris.ka...@oracle.com>> wrote:



https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search

<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com
<mailto:baris.ka...@oracle.com> wrote:
> Hi,-
>
>  I use IndexSearcher.search API with two parameters like Query
and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result
of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or
Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs
object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why
org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
<mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: java-user-h...@lucene.apache.org
<mailto:java-user-h...@lucene.apache.org>



--
Adrien


Re: An interesting case

2021-06-07 Thread Baris Kazar
my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a couple of 
1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards

From: Adrien Grand 
Sent: Tuesday, June 8, 2021 2:46 AM
To: Lucene Users Mailing List
Cc: Baris Kazar
Subject: Re: An interesting case

When you call IndexSearcher#search(Query query, int n), there are two cases:
 - either your query matches n hits or more, and the TopDocs object will have a 
ScoreDoc[] array that contains the n best scoring hits sorted by descending 
score,
 - or your query matches less then n hits and then the TopDocs object will have 
all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number of 
matches of the query. On older versions of Lucene (<7.0) this is an integer 
that is always accurate, while on more recent versions of Lucene (>= 8.0) it is 
a lower bound of the total number of matches. It typically returns the number 
of collected documents indeed, though this is an implementation detail that 
might change in the future.

If you want to count the number of matches of a Query precisely, you can use 
IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM 
mailto:baris.ka...@oracle.com>> wrote:
https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com<mailto:baris.ka...@oracle.com> wrote:
> Hi,-
>
>  I use IndexSearcher.search API with two parameters like Query and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

-
To unsubscribe, e-mail: 
java-user-unsubscr...@lucene.apache.org<mailto:java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail: 
java-user-h...@lucene.apache.org<mailto:java-user-h...@lucene.apache.org>



--
Adrien


Re: An interesting case

2021-06-07 Thread baris . kazar

https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.ka...@oracle.com wrote:

Hi,-

 I use IndexSearcher.search API with two parameters like Query and int 
number (i set as 20).


However, when i look at the TopDocs object which is the result of this 
above API call


i see thousands of hits from totalhits. Is this inaccurate or Lucene 
is doing actually search based on that many results?


But when i iterate over result of above API call's scoreDocs object i 
get int number of hits (ie, 20 hits).



I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits 
report a number of collected results than


the actual number of results. I see on the order of couple of 
thousands vs 20.



Best regards





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



An interesting case

2021-06-07 Thread baris . kazar

Hi,-

 I use IndexSearcher.search API with two parameters like Query and int 
number (i set as 20).


However, when i look at the TopDocs object which is the result of this 
above API call


i see thousands of hits from totalhits. Is this inaccurate or Lucene is 
doing actually search based on that many results?


But when i iterate over result of above API call's scoreDocs object i 
get int number of hits (ie, 20 hits).



I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits 
report a number of collected results than


the actual number of results. I see on the order of couple of thousands 
vs 20.



Best regards




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Interface IndexReader.CacheHelper

2021-03-26 Thread baris . kazar

Hi,-

https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/index/IndexReader.CacheHelper.html?is-external=true

 it would be nice to have more detailed explanation and maybe an 
example for this interesting interface?


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



MemoryIndex class

2021-03-26 Thread baris . kazar

Hi,-

https://lucene.apache.org/core/8_5_2/memory/index.html

what is meant by single document in this sentence?

"High-performance single-document main memory Apache Lucene fulltext 
search index."



The doc for this MemoryIndex still mentions about the deprecated class 
RAMDirectory.


https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/RAMDirectory.html


It would be nice if more info is available about Lucene classes.

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



NRTCachingDirectory class information

2021-03-26 Thread baris . kazar

Hi,-

 Related to my previous thread: Warming up index files via cat to make 
it in memory index



I found out about this class in the Book Lucene 4 Cookbook by Edwood Ng.


May i please ask about any pointers, best practices paper or any Lucene 
documentation for


comparing this NRTCachingDirectory class for performance?

https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/NRTCachingDirectory.html


In the docs link above, there is a statement like this:

This will cache all newly flushed segments, all merges whose expected 
segment size is <= 5 MB, unless the net cached bytes exceeds 60 MB at 
which point all writes will not be cached (until the net bytes falls 
below 60 MB).



Is there also a Lucene doc where the indexing is described during query 
(search) process?


Best regards


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Warming up index files via cat to make it in memory index

2021-03-25 Thread baris . kazar

Hi,-

 This new thread is the continuation of previous thread back in Feb 2021:

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)


May i mention that i cat'ed *fdt files (largest index files among 98 
index files generated) by directing to new files so


that these files are cached to create in memory lucene index? i have 
598GB available in memory and total size of index files is < 16G.


So, i think the cat'ed fdt files must be in the memory.

i also cat'ed smaller files via just cat .


I still see similar performance from lucene index after doing this.

Is there any suggestions please?

Thanks




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Baris Kazar
So, just cat  will do this.
Thanks

From: Robert Muir 
Sent: Tuesday, February 23, 2021 4:45 PM
To: Baris Kazar 
Cc: java-user 
Subject: Re: MMapDirectory vs In Memory Lucene Index (i.e., 
ByteBuffersDirectory)

The preload isn't magical.
It only "reads in the whole file" to get it cached, same as if you did that 
yourself with 'cat' or 'dd'.
It "warms" the file.

It just does this in an efficient way at the low level to make the warming 
itself efficient. It madvise()s kernel to announce some read-ahead and then 
reads the first byte of every mmap'd page (which is enough to fault it in).

At the end of the day it doesn't matter if you wrote a shitty shell script that 
uses 'dd' to read in each index file and send it to /dev/null, or whether you 
spent lots of time writing fancy java code to call this preload thing: you get 
the same result, same end state.

Maybe the preload takes 18 seconds to "warm" the index, vs. your crappy shell 
script which takes 22 seconds. It is mainly more important for servers and 
portability (e.g. it will work fine on windows, but obviously will not call 
madvise).

On Tue, Feb 23, 2021 at 4:18 PM 
mailto:baris.ka...@oracle.com>> wrote:

Thanks again, Robert. Could you please explain "preload"? Which functionality 
is that? we discussed in this thread before about a preload.

Is there a Lucene url / site that i can look at for preload?

Thanks for the explanations. This thread will be useful for many folks i 
believe.

Best regards


On 2/23/21 4:15 PM, Robert Muir wrote:


On Tue, Feb 23, 2021 at 4:07 PM 
mailto:baris.ka...@oracle.com>> wrote:

What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory

On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you don't 
need to do anything.

Either way MMapDirectory or NIOFSDirectory are doing the same thing: reading 
your index as a normal file and letting the operating system cache it.
The MMapDirectory is just better because it avoids some overheads, such as 
read() system call, copying and buffering into java memory space, etc etc.
Some of these overheads are only getting worse, e.g. spectre/meltdown-type 
fixes make syscalls 8x slower on my computer. So it is good that MMapDirectory 
avoids it.

So I suggest just stop fighting the operating system, don't give your J2EE 
container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's cache, then 
look into preload and so on. It is just "reading files" to get them cached.


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
Thanks again, Robert. Could you please explain "preload"? Which 
functionality is that? we discussed in this thread before about a preload.


Is there a Lucene url / site that i can look at for preload?

Thanks for the explanations. This thread will be useful for many folks i 
believe.


Best regards


On 2/23/21 4:15 PM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 4:07 PM > wrote:


What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with
MMapDirectory

On 64-bit systems, FSDirectory just invokes MMapDirectory already. So 
you don't need to do anything.


Either way MMapDirectory or NIOFSDirectory are doing the same thing: 
reading your index as a normal file and letting the operating system 
cache it.
The MMapDirectory is just better because it avoids some overheads, 
such as read() system call, copying and buffering into java memory 
space, etc etc.
Some of these overheads are only getting worse, e.g. 
spectre/meltdown-type fixes make syscalls 8x slower on my computer. So 
it is good that MMapDirectory avoids it.


So I suggest just stop fighting the operating system, don't give your 
J2EE container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's 
cache, then look into preload and so on. It is just "reading files" to 
get them cached.


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

(edited previous response)


Thanks, but each different query at the first run i see some slowdown 
(not much though) with MMapDirectory and FSDirectory wrt second, third 
runs (due to cold start), though.


Cold start slowdown is a little bit more with FSdirectory. So, 
MMapDirectory is slightly better in that, too: ie, cold start.



What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory


Uwe mentioned tmpfs will help. i will try that next.


I thought preload was not helping much as we discussed here.

Thanks


On 2/23/21 3:54 PM, Robert Muir wrote:
speedup over what? You are probably already using MMapDirectory (it is 
the default). So I don't know what you are trying to achieve, but 
giving lots of memory to your java process is not going to help.


If you just want to prevent the first few queries to a fresh cold 
machine instance from being slow, you can use the preload for that 
before you make it available. You could also use 'cat' or 'dd'.


On Tue, Feb 23, 2021 at 3:45 PM > wrote:


Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on
order of magnitude of speedup from already very fast on disk
Lucene indexes.

So i was expecting really really really fast response with
MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:

Don't give gobs of memory to your java process, you will just
make things slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM mailto:baris.ka...@oracle.com>> wrote:

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >> wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index
size on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index,
MMapDirectory does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
Thanks, but each different query i see some slowdown (not much though) 
with MMapDirectory and FSDirectory, though.


It is a little bit more with FSdirectory. So, MMapDirectory is slightly 
better in that, too: ie, cold start.



What i want to achieve: Problem statement:

base case is disk based Lucene index with FSDirectory

speedup case was supposed to be in memory Lucene index with MMapDirectory


Uwe mentioned tmpfs will help. i will try that next.

Thanks


On 2/23/21 3:54 PM, Robert Muir wrote:
speedup over what? You are probably already using MMapDirectory (it is 
the default). So I don't know what you are trying to achieve, but 
giving lots of memory to your java process is not going to help.


If you just want to prevent the first few queries to a fresh cold 
machine instance from being slow, you can use the preload for that 
before you make it available. You could also use 'cat' or 'dd'.


On Tue, Feb 23, 2021 at 3:45 PM > wrote:


Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on
order of magnitude of speedup from already very fast on disk
Lucene indexes.

So i was expecting really really really fast response with
MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:

Don't give gobs of memory to your java process, you will just
make things slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM mailto:baris.ka...@oracle.com>> wrote:

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >> wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index
size on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index,
MMapDirectory does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

Thanks but then how will MMapDirectory help gain speedup?

i will try tmpfs and see what happens. i was expecting to get on order 
of magnitude of speedup from already very fast on disk Lucene indexes.


So i was expecting really really really fast response with MMapDirectory.

Thanks


On 2/23/21 3:40 PM, Robert Muir wrote:
Don't give gobs of memory to your java process, you will just make 
things slower. The kernel will cache your index files.


On Tue, Feb 23, 2021 at 1:45 PM > wrote:


Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:
>
>
> On Tue, Feb 23, 2021 at 2:30 AM mailto:baris.ka...@oracle.com>
> >>
wrote:
>
>     Hi,-
>
>       I tried MMapDirectory and i allocated as big as index size
on my
>     J2EE
>     Container but
>
>
> Don't allocate java heap memory for the index, MMapDirectory
does not
> use java heap memory!



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar
As Uwe suggested some time ago, tmpfs file system usage with 
MMapDirectory is


the only way to get high speedup wrt on disk Lucene index, right?

Best regards


On 2/23/21 1:44 PM, baris.ka...@oracle.com wrote:


Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 2:30 AM > wrote:


Hi,-

  I tried MMapDirectory and i allocated as big as index size on
my J2EE
Container but


Don't allocate java heap memory for the index, MMapDirectory does not 
use java heap memory!


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread baris . kazar

Ok, but how is this MMapDirectory used then?

Best regards


On 2/23/21 7:03 AM, Robert Muir wrote:



On Tue, Feb 23, 2021 at 2:30 AM > wrote:


Hi,-

  I tried MMapDirectory and i allocated as big as index size on my
J2EE
Container but


Don't allocate java heap memory for the index, MMapDirectory does not 
use java heap memory!


Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-22 Thread baris . kazar

Hi,-

 I tried MMapDirectory and i allocated as big as index size on my J2EE 
Container but


it only gives me at most 25% speedup and even sometimes a small amount 
of slowdown.


How can i effectively use Lucene indexes in memory?

Best regards


On 12/14/20 6:35 PM, baris.ka...@oracle.com wrote:

Thanks Robert.

I think these valuable comments need to be placed on javadocs for 
future references.


i think i am getting enough info for making a decision:

i will use MMapDirectory without setPreload and i hope my index will 
fit into the RAM.


i plan to post a blog for findings.

Best regards


On 12/14/20 5:52 PM, Robert Muir wrote:

On Mon, Dec 14, 2020 at 1:59 PM Uwe Schindler  wrote:

Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog 
post is

to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it 
does is to
just touch every page of the file, so the linux kernel puts it into 
OS cache
- nothing more; IMHO very ineffective as it slows down openining 
index for a
stupid for-each-page-touch-loop. It will do this with EVERY page, if 
it is
later used or not! So this may take some time until it is done. 
Lateron,

still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and 
execute

some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are 
not used

won't be touched, and on top, it will also load ALL the required index
structures to heap.


The main purpose of this thing is a fast warming option for random
access files such as "i want to warm all my norms in RAM" or "i want
to warm all my docvalues in RAM"... really it should only be used with
the FileSwitchDirectory for a targeted purpose such as that: it is
definitely a waste to set it for your entire index. It is just
exposing the 
https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html#load()

which first calls madvise(MADV_WILLNEED) and then touches every page.
If you want to "warm" an ENTIRE very specific file for a reason like
this (e.g. per-doc scoring value, ensuring it will be hot for all
docs), it is hard to be more efficient than that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar

Thanks Robert.

I think these valuable comments need to be placed on javadocs for future 
references.


i think i am getting enough info for making a decision:

i will use MMapDirectory without setPreload and i hope my index will fit 
into the RAM.


i plan to post a blog for findings.

Best regards


On 12/14/20 5:52 PM, Robert Muir wrote:

On Mon, Dec 14, 2020 at 1:59 PM Uwe Schindler  wrote:

Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it does is to
just touch every page of the file, so the linux kernel puts it into OS cache
- nothing more; IMHO very ineffective as it slows down openining index for a
stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
later used or not! So this may take some time until it is done. Lateron,
still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and execute
some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are not used
won't be touched, and on top, it will also load ALL the required index
structures to heap.


The main purpose of this thing is a fast warming option for random
access files such as "i want to warm all my norms in RAM" or "i want
to warm all my docvalues in RAM"... really it should only be used with
the FileSwitchDirectory for a targeted purpose such as that: it is
definitely a waste to set it for your entire index. It is just
exposing the 
https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html#load()
which first calls madvise(MADV_WILLNEED) and then touches every page.
If you want to "warm" an ENTIRE very specific file for a reason like
this (e.g. per-doc scoring value, ensuring it will be hot for all
docs), it is hard to be more efficient than that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar
I see, i think i will use first way the constructor woith MMap and i 
will not use setPreload api to avoid slowdowns.


yes, i was expecting a warning from eclipse in the second usage but 
nothing came up.


Thanks for the clarifications.

Best regards


On 12/14/20 2:55 PM, Uwe Schindler wrote:

Hi,

  

Thanks Uwe, i am not insisting on to load everything into memory

but loading into memory might speed up and i would like to see how much
speedup.


but i have one more question and that is still not clear to me:

"it is much better to open index, with MMAP directory"


does this mean i should not use the constructor but instead use the open
api?

No that means, use MMapDirectory, it should fit your needs. If you have enough 
memory outside of heap in your operating system that can be used by Lucene to 
have all pages of the mmaped file in memory then it’s the best you can have.

FSDirectory.open() is fine as it will always use MMapDirectory on 64 bit 
platforms.


in other words: which way should be preferred?

Does not matter. If you want to use setPreload() [beware of slowdowns on 
opening index files for first time!!!], use constructor of MMAPDirectory, 
because the FSDirectoryFactory cannot guarantee which implementation you get.

Calling a static method on a class that does not implement it, is generally 
considered bad practise (Eclipse should warn you). The static 
FSDirectory.open() is a factory method and should be used (on FSDircetory not 
its subclass) if you don't know what you want to have and be operating system 
independent. If you want MMapDirectory and its features specifically, use the 
constructor.


The example is from both during indexing and searching:


/*First way: Using constructor (without setPreload) :*/

MMapDirectory dir = new MMapDirectory(Paths.get(indexDir)); // Uses
FSLockFactory.getDefault() and DEFAULT_MAX_CHUNK_SIZE which is 1GB
if (dir.getPreload() == false)
  dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index
enabled-> *commented out*
IndexReader reader = DirectoryReader.open(dir);

...


/*Second way: Or using open (without setPreload) :*/

*Directory* dir = MMapDirectory.open(Paths.get(indexDir)); //open is
inherited from FSDirectory
if (dir.getPreload() == false)
  dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index
enabled-> *here setPreload cannot be used*
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(reader);

...


Best regards


On 12/14/20 1:51 PM, Uwe Schindler wrote:

Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it does is to
just touch every page of the file, so the linux kernel puts it into OS cache
- nothing more; IMHO very ineffective as it slows down openining index for a
stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
later used or not! So this may take some time until it is done. Lateron,
still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and execute
some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are not used
won't be touched, and on top, it will also load ALL the required index
structures to heap.

As always and as mentioned in my blog post: there's nothing that can ensure
your index will stays in memory. Please trust the kernel to do the right
thing. Why do you care at all?

If you are curious and want to have everything in memory all the time:
- use tmpfs as your filesystem (of course you will loose data when OS shuts
down)
- disable swap and/or disable swapiness
- use only as much heap as needed, keep everything of free memory for your
index outside heap.

Fake feelings of "everything in RAM" are misconceptions like:
- use RAMDirectory (deprecated): this may be a desaster as it described in
the blog post
- use ByteBuffersDirectory: a little bit better, but this brings nothing, as
the operating system kernel may still page out your index pages. They still
live in/off heap and are part of usual paging. They are just no longer
backed by a file.

Lucene does most of the stuff outside heap, live with it!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen


https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!Ll3PR
4BZgqmgJNQ7MrnsXr27zNYgjsyXlMh9h6awmbZgSNW-
yVLBCDuFHTogNnw9_Q$

eMail: u...@thetaphi.de


-Original Message-
From: baris.ka...@oracle.com 
Sent: Sunday, December 13, 2020 10:18 PM
To: java-user@lucene.apache.org
Cc: BARIS KAZAR 
Subject: MMapDirectory vs In Memory Lucene Index (i.e.,

ByteBuffersDirectory)

Hi,-

it would be nice to create a Luc

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar

This also brings me another question:

does using MMap over FSDirectory bring any advantage with or without tmpfs?

Best regards


On 12/14/20 2:17 PM, Jigar Shah wrote:

Thanks, Uwe

Yes, recommended, tmpfs/ramfs worked like a charm in our use-case with a
read-only index, giving us very high-throughput and consistent response
time on queries.

We had to have some redundancy to be built around that service to be
high-available, so we can do a rolling update on the read-only index
reducing the risk of downtime.



On Mon, Dec 14, 2020 at 1:51 PM Uwe Schindler  wrote:


Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it does is
to
just touch every page of the file, so the linux kernel puts it into OS
cache
- nothing more; IMHO very ineffective as it slows down openining index for
a
stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
later used or not! So this may take some time until it is done. Lateron,
still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and execute
some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are not used
won't be touched, and on top, it will also load ALL the required index
structures to heap.

As always and as mentioned in my blog post: there's nothing that can ensure
your index will stays in memory. Please trust the kernel to do the right
thing. Why do you care at all?

If you are curious and want to have everything in memory all the time:
- use tmpfs as your filesystem (of course you will loose data when OS shuts
down)
- disable swap and/or disable swapiness
- use only as much heap as needed, keep everything of free memory for your
index outside heap.

Fake feelings of "everything in RAM" are misconceptions like:
- use RAMDirectory (deprecated): this may be a desaster as it described in
the blog post
- use ByteBuffersDirectory: a little bit better, but this brings nothing,
as
the operating system kernel may still page out your index pages. They still
live in/off heap and are part of usual paging. They are just no longer
backed by a file.

Lucene does most of the stuff outside heap, live with it!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!MAgLdznjSB6VCUW53bxfBB8GANAgHBAQCr4Jl4NIxTNKYeLlRtOl1TtPJMV80mkA-w$
eMail: u...@thetaphi.de


-Original Message-
From: baris.ka...@oracle.com 
Sent: Sunday, December 13, 2020 10:18 PM
To: java-user@lucene.apache.org
Cc: BARIS KAZAR 
Subject: MMapDirectory vs In Memory Lucene Index (i.e.,

ByteBuffersDirectory)

Hi,-

it would be nice to create a Lucene index in files and then effectively

load it

into memory once (since i use in read-only mode). I am looking into if

this is

doable in Lucene.

i wish there were an option to load whole Lucene index into memory:

Both of below urls have links to the blog url where i quoted a very nice

section:

https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDi__;!!GqivPVa7Brio!MAgLdznjSB6VCUW53bxfBB8GANAgHBAQCr4Jl4NIxTNKYeLlRtOl1TtPJMXBLamTEw$
rectory.html
https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDi__;!!GqivPVa7Brio!MAgLdznjSB6VCUW53bxfBB8GANAgHBAQCr4Jl4NIxTNKYeLlRtOl1TtPJMV5-KIYlg$
rectory.html

This following blog mentions about such option
to run in the memory: (see the underlined sentence below)

https://urldefense.com/v3/__https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-__;!!GqivPVa7Brio!MAgLdznjSB6VCUW53bxfBB8GANAgHBAQCr4Jl4NIxTNKYeLlRtOl1TtPJMXkDOv-_A$
64bit.html?m=1

MMapDirectory will not load the whole index into physical memory. Why
should it do this? We just ask the operating system to map the file into

address

space for easy access, by no means we are requesting more. Java and the

O/S

optionally provide the option to try loading the whole file into RAM (if

enough

is available), but Lucene does not use that option (we may add this

possibility

in a later version).

My question is: is there such an option?
is the method setPreLoad for this purpose:
to load all Lucene lndex into memory?

I would like to use MMapDirectory and set my
JVM heap to 16G or a bit less (since my index is
around this much).

The Lucene 8.5.2 (8.5.0 as well) javadocs say:
public void setPreload(boolean preload)
Set to true to ask mapped pages to be loaded into physical memory on

init.
The

behavior is best-effort and operating system dependent.

For example Lucene 4.0.0 does not have setPreLoad method.

https://urldefense.com/v3/__https://lucene.apache.org

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar

Thanks Uwe, i am not insisting on to load everything into memory

but loading into memory might speed up and i would like to see how much 
speedup.



but i have one more question and that is still not clear to me:

"it is much better to open index, with MMAP directory"


does this mean i should not use the constructor but instead use the open 
api?



in other words: which way should be preferred?

The example is from both during indexing and searching:


/*First way: Using constructor (without setPreload) :*/

MMapDirectory dir = new MMapDirectory(Paths.get(indexDir)); // Uses 
FSLockFactory.getDefault() and DEFAULT_MAX_CHUNK_SIZE which is 1GB

if (dir.getPreload() == false)
  dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index 
enabled-> *commented out*

IndexReader reader = DirectoryReader.open(dir);

...


/*Second way: Or using open (without setPreload) :*/

*Directory* dir = MMapDirectory.open(Paths.get(indexDir)); //open is 
inherited from FSDirectory

if (dir.getPreload() == false)
  dir.setPreload(Constants.PRELOAD_YES); // In-Memory Lucene Index 
enabled-> *here setPreload cannot be used*

IndexReader reader = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(reader);

...


Best regards


On 12/14/20 1:51 PM, Uwe Schindler wrote:

Hi,

as writer of the original bog post, here my comments:

Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
to load everything into memory - but that does not guarantee anything!
Still, I would not recommend to use that function, because all it does is to
just touch every page of the file, so the linux kernel puts it into OS cache
- nothing more; IMHO very ineffective as it slows down openining index for a
stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
later used or not! So this may take some time until it is done. Lateron,
still Lucene needs to open index files, initialize its own data
structures,...

In general it is much better to open index, with MMAP directory and execute
some "sample" queries. This will do exactly the same like the preload
function, but it is more "selective". Parts of the index which are not used
won't be touched, and on top, it will also load ALL the required index
structures to heap.

As always and as mentioned in my blog post: there's nothing that can ensure
your index will stays in memory. Please trust the kernel to do the right
thing. Why do you care at all?

If you are curious and want to have everything in memory all the time:
- use tmpfs as your filesystem (of course you will loose data when OS shuts
down)
- disable swap and/or disable swapiness
- use only as much heap as needed, keep everything of free memory for your
index outside heap.

Fake feelings of "everything in RAM" are misconceptions like:
- use RAMDirectory (deprecated): this may be a desaster as it described in
the blog post
- use ByteBuffersDirectory: a little bit better, but this brings nothing, as
the operating system kernel may still page out your index pages. They still
live in/off heap and are part of usual paging. They are just no longer
backed by a file.

Lucene does most of the stuff outside heap, live with it!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!Ll3PR4BZgqmgJNQ7MrnsXr27zNYgjsyXlMh9h6awmbZgSNW-yVLBCDuFHTogNnw9_Q$
eMail: u...@thetaphi.de


-Original Message-
From: baris.ka...@oracle.com 
Sent: Sunday, December 13, 2020 10:18 PM
To: java-user@lucene.apache.org
Cc: BARIS KAZAR 
Subject: MMapDirectory vs In Memory Lucene Index (i.e.,

ByteBuffersDirectory)

Hi,-

it would be nice to create a Lucene index in files and then effectively

load it

into memory once (since i use in read-only mode). I am looking into if

this is

doable in Lucene.

i wish there were an option to load whole Lucene index into memory:

Both of below urls have links to the blog url where i quoted a very nice

section:

https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDi__;!!GqivPVa7Brio!Ll3PR4BZgqmgJNQ7MrnsXr27zNYgjsyXlMh9h6awmbZgSNW-yVLBCDuFHTrcPLQ6cQ$
rectory.html
https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDi__;!!GqivPVa7Brio!Ll3PR4BZgqmgJNQ7MrnsXr27zNYgjsyXlMh9h6awmbZgSNW-yVLBCDuFHToSKhCY-w$
rectory.html

This following blog mentions about such option
to run in the memory: (see the underlined sentence below)

https://urldefense.com/v3/__https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-__;!!GqivPVa7Brio!Ll3PR4BZgqmgJNQ7MrnsXr27zNYgjsyXlMh9h6awmbZgSNW-yVLBCDuFHTpvqnQhbA$
64bit.html?m=1

MMapDirectory will not load the whole index into physical memory. Why
should it do this? We just ask the operating system to map the file into

address

space for easy access, by no means we are requesting more. Java and the

O/S

optionally provi

MMapDirectory usage during indexing and search

2020-12-14 Thread baris . kazar

Hi,-

 are there some examples on how to use MMapDirectory during indexing (i 
used the constructor to create it) and search?


what are the best practices?

should i repeat during search what i did during indexing for 
MMapDirectory i.e, use the constructor to create the MMapDirectory 
object by passing path?


or: should i use the open api of MMapDirectory (which is inherited from 
FSDirectory) during search?


Best regards


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar
Thanks Jigar, these are great notes, observations, experiments to know 
about and they are very very valuable,


i also plan to write a blog on this topic to help Lucene advance.

Best regards


On 12/14/20 12:44 PM, Jigar Shah wrote:

I used one of the Linux feature (ramfs, basically mounting ram on a
partition) to guarantee that it's always in ram (No accidental paging ;)
cost too).

https://urldefense.com/v3/__https://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux__;!!GqivPVa7Brio!L7o3DbosKYTNGBfhVwhvr1QLg-A2u4Xd8QWD5FKapojFuxlIEAQY7H3KlnA2YBj41g$

WARN: Only use if it's a read-only index and can fit in ram and have a
back-up copy of that index on persistent disk somewhere. You may use any
directory implementation in Lucene. e.g
https://urldefense.com/v3/__https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/store/SimpleFSDirectory.html__;!!GqivPVa7Brio!L7o3DbosKYTNGBfhVwhvr1QLg-A2u4Xd8QWD5FKapojFuxlIEAQY7H3KlnCKbHPcgQ$

The search was amazingly quick as the full index was on ram mounted
directory.









On Mon, Dec 14, 2020 at 11:27 AM  wrote:


Thanks Mike, appreciate the reply and the suggestions very much.

And Your article link to concurrent search is amazing.

Together with in memory and concurrent index (especially in read only mode)

these will speed up Lucene queries very much.

Happy Holidays

Best regards


On 12/14/20 10:12 AM, Michael McCandless wrote:

Hello,

Yes, that is exactly what MMapDirectory.setPreload is trying to do, but

not

promises (it is best effort).  I think it asks the OS to touch all pages

in

the mapped region so they are cached in RAM, if you have enough RAM.

Make your JVM heap as low as possible to let the OS have more RAM to use

to

load your index.

Mike McCandless



https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJn-Lr5mA$


On Sun, Dec 13, 2020 at 4:18 PM  wrote:


Hi,-

it would be nice to create a Lucene index in files and then effectively
load it into memory once (since i use in read-only mode). I am looking

into

if this is doable in Lucene.

i wish there were an option to load whole Lucene index into memory:

Both of below urls have links to the blog url where i quoted a very nice
section:




https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJaN3djDw$



https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJhxlyzBw$

This following blog mentions about such option
to run in the memory: (see the underlined sentence below)




https://urldefense.com/v3/__https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?m=1__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJ1O4pdIg$

MMapDirectory will not load the whole index into physical memory. Why
should it do this? We just ask the operating system to map the file into
address space for easy access, by no means we are requesting more. Java

and

the O/S optionally provide the option to try loading the whole file into
RAM (if enough is available), but Lucene does not use that option (we

may

add this possibility in a later version).

My question is: is there such an option?
is the method setPreLoad for this purpose:
to load all Lucene lndex into memory?

I would like to use MMapDirectory and set my
JVM heap to 16G or a bit less (since my index is
around this much).

The Lucene 8.5.2 (8.5.0 as well) javadocs say:
public void setPreload(boolean preload)
Set to true to ask mapped pages to be loaded into physical memory on

init.

The behavior is best-effort and operating system dependent.

For example Lucene 4.0.0 does not have setPreLoad method.




https://urldefense.com/v3/__https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJ_Zf_dhQ$

Happy Holidays
Best regards


Ps. i know there is also BytesBuffersDirectory class for in memory

Lucene

but this requires creating Lucene Index on the fly.

This is great for only such kind of Lucene indexes that can be created
quickly on the fly.

Ekaterina has a nice article on this BytesBuffersDirectory class:




https://urldefense.com/v3/__https://medium.com/@ekaterinamihailova/in-memory-search-and-autocomplete-with-lucene-8-5-f2df1bc71c36__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOIosJjRzQ$



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional comm

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread baris . kazar

Thanks Mike, appreciate the reply and the suggestions very much.

And Your article link to concurrent search is amazing.

Together with in memory and concurrent index (especially in read only mode)

these will speed up Lucene queries very much.

Happy Holidays

Best regards


On 12/14/20 10:12 AM, Michael McCandless wrote:

Hello,

Yes, that is exactly what MMapDirectory.setPreload is trying to do, but not
promises (it is best effort).  I think it asks the OS to touch all pages in
the mapped region so they are cached in RAM, if you have enough RAM.

Make your JVM heap as low as possible to let the OS have more RAM to use to
load your index.

Mike McCandless

https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJn-Lr5mA$


On Sun, Dec 13, 2020 at 4:18 PM  wrote:


Hi,-

it would be nice to create a Lucene index in files and then effectively
load it into memory once (since i use in read-only mode). I am looking into
if this is doable in Lucene.

i wish there were an option to load whole Lucene index into memory:

Both of below urls have links to the blog url where i quoted a very nice
section:


https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJaN3djDw$

https://urldefense.com/v3/__https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJhxlyzBw$

This following blog mentions about such option
to run in the memory: (see the underlined sentence below)


https://urldefense.com/v3/__https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?m=1__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJ1O4pdIg$

MMapDirectory will not load the whole index into physical memory. Why
should it do this? We just ask the operating system to map the file into
address space for easy access, by no means we are requesting more. Java and
the O/S optionally provide the option to try loading the whole file into
RAM (if enough is available), but Lucene does not use that option (we may
add this possibility in a later version).

My question is: is there such an option?
is the method setPreLoad for this purpose:
to load all Lucene lndex into memory?

I would like to use MMapDirectory and set my
JVM heap to 16G or a bit less (since my index is
around this much).

The Lucene 8.5.2 (8.5.0 as well) javadocs say:
public void setPreload(boolean preload)
Set to true to ask mapped pages to be loaded into physical memory on init.
The behavior is best-effort and operating system dependent.

For example Lucene 4.0.0 does not have setPreLoad method.


https://urldefense.com/v3/__https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/MMapDirectory.html__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOJ_Zf_dhQ$

Happy Holidays
Best regards


Ps. i know there is also BytesBuffersDirectory class for in memory Lucene
but this requires creating Lucene Index on the fly.

This is great for only such kind of Lucene indexes that can be created
quickly on the fly.

Ekaterina has a nice article on this BytesBuffersDirectory class:


https://urldefense.com/v3/__https://medium.com/@ekaterinamihailova/in-memory-search-and-autocomplete-with-lucene-8-5-f2df1bc71c36__;!!GqivPVa7Brio!LEQH8Tyb_BBN_Kc3fEH2w-yhpvS-VwMrpuB0gctqchp3j7L7V6x9piciHOIosJjRzQ$




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-13 Thread baris . kazar
Hi,-

it would be nice to create a Lucene index in files and then effectively load it 
into memory once (since i use in read-only mode). I am looking into if this is 
doable in Lucene.

i wish there were an option to load whole Lucene index into memory:

Both of below urls have links to the blog url where i quoted a very nice 
section:

https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDirectory.html
https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDirectory.html

This following blog mentions about such option
to run in the memory: (see the underlined sentence below)

https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?m=1

MMapDirectory will not load the whole index into physical memory. Why should it 
do this? We just ask the operating system to map the file into address space 
for easy access, by no means we are requesting more. Java and the O/S 
optionally provide the option to try loading the whole file into RAM (if enough 
is available), but Lucene does not use that option (we may add this possibility 
in a later version).

My question is: is there such an option?
is the method setPreLoad for this purpose:
to load all Lucene lndex into memory?

I would like to use MMapDirectory and set my
JVM heap to 16G or a bit less (since my index is
around this much).

The Lucene 8.5.2 (8.5.0 as well) javadocs say:
public void setPreload(boolean preload)
Set to true to ask mapped pages to be loaded into physical memory on init. The 
behavior is best-effort and operating system dependent.

For example Lucene 4.0.0 does not have setPreLoad method.

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/MMapDirectory.html

Happy Holidays
Best regards


Ps. i know there is also BytesBuffersDirectory class for in memory Lucene but 
this requires creating Lucene Index on the fly.

This is great for only such kind of Lucene indexes that can be created quickly 
on the fly.

Ekaterina has a nice article on this BytesBuffersDirectory class:

https://medium.com/@ekaterinamihailova/in-memory-search-and-autocomplete-with-lucene-8-5-f2df1bc71c36



Re: Fwd: org.apache.lucene.index.DirectoryReader Javadocs

2020-12-10 Thread baris . kazar

Thanks for the reply.


Sure, i should have included the url since it already caused confusion.

Here is the url:

https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/index/DirectoryReader.html

Please see:

    open(Directory directory, Map readerAttributes)
Returns a IndexReader reading the index in the given Directory


readerAttributes - the reader attributes passed to the Codec layer of 
the directory reader. This attribute map is forwarded to all leaf 
readers as well as to the readers that are opened subsequently via the 
different flavors of openIfChanged(DirectoryReader)


However, a small example would be perfect for this parameter.


It looks like this api is gone in 8.7.0 version and that is why You did 
not see on the url You sent for 8.7.0 version.


(i was not using this api but i was just trying to understand this above 
api).




it would be very helpful to see information on javadocs about

what to use that is functionality wise nearest to whichever the new api is

when an api is gone.


Thanks



On 12/10/20 1:41 PM, Trejkaz wrote:


May i request to add more info into Lucene
org.apache.lucene.index.DirectoryReader about reaOnly=true attribute and

more info on readerAttributes parameters please?

Referring to the current documentation:
https://urldefense.com/v3/__https://javadoc.io/doc/org.apache.lucene/lucene-core/latest/org/apache/lucene/index/DirectoryReader.html__;!!GqivPVa7Brio!JxkORAiCDLl1lBPEdtNZPKfjHXNi7SaMiaTOk2TUqX_0eRcytYXhJpeDkC2_NUNpLw$

I see no such readerAttributes to which more information should be added.

Perhaps you should provide a URL to the documentation you are talking
about, so that people might know what you're going on about.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



org.apache.lucene.index.DirectoryReader Javadocs

2020-12-10 Thread baris . kazar

Hi,-

May i request to add more info into Lucene 
org.apache.lucene.index.DirectoryReader about reaOnly=true attribute and


more info on readerAttributes parameters please?

I guess the default is read only, right?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Great answer
Thanks Michael.

Yes the difference was too much > 1G
Best regards

> On Nov 13, 2020, at 1:49 PM, Michael Sokolov  wrote:
> 
> You can't directly compare disk usage across two indexes, even with
> the same data. Try re-indexing one of your datasets, and you will see
> that the disk size is not the same. Mostly this is due to the way
> segments are merged varying with some randomness from one run to
> another, although the size of the difference you report is pretty
> large, it is not out of the question that could occur, especially if
> you have a large number of deletions or updates to existing documents.
> If you want to get a more accurate idea of the amount of space taken
> up by your index, you could try calling IndexWriter.forceMerge(1);
> this will merge your index to a single segment, eliminating waste. It
> is not generally recommended to do this for indexes you use for
> querying, but it can be a useful tool for analysis.
> 
>> On Fri, Nov 13, 2020 at 1:01 PM  wrote:
>> 
>> Nothing changed between two index generations except the data changed a
>> bit as i described.
>> 
>> When Lucene is done generating index, that is what i am reporting as the
>> size of the directory where all index files are stored.
>> 
>> I dont know about deleted docs? How do you trace that? yes the queries
>> run exactly the same way (same number of results) most of the time the
>> order is just changed which is fine; or some few different entries show
>> up and i dont know why since lowecase filter should normalize even if
>> original data casing changes.
>> 
>> Yes absolutely sure nothing else changed. i kept all those things the
>> same across two runs.
>> 
>> actually does lucene repository have these kinda experiments accross
>> versions (major or minor versions)?
>> 
>> if i were lucene i would do these experiments to see the impact on index
>> end results. this will help find out some potential un-indentified bugs.
>> 
>> Methodology:
>> 
>> have a large dataset like 15 million docs
>> 
>> run index at each time a new version comes out with very common settings.
>> 
>> 
>> i am not using solr, pure lucene 7.7.2. these info were in the other
>> email here. let me copy paste here:
>> 
>> 
>> 
>> = previous email 
>> 
>> On a related issue:
>> 
>> i experience that with Version 7.7.2 i experienced this:
>> 
>> data is all lower case (same amount of docs as next case though)
>> 
>> vs
>> 
>> data is camel case except last word always in capital letters
>> 
>> 
>> but i used in indexer the lowercase filter in both cases so indexing is
>> done with all lower cases and i saw the first case's index size for case
>> is like 9.5GB
>> 
>> but same data size for second case was 11GB.
>> 
>> 
>> what causes such difference and increase in index size? amount of docs
>> are the same in both cases.
>> 
>> 
>> Best regards
>> 
>> 
>> 
>>> On 11/13/20 7:39 AM, Erick Erickson wrote:
>>> What does “final finished sizes” mean? After optimize of just after 
>>> finishing all indexing?
>>> The former is what counts here.
>>> 
>>> And you provided no information on the number of deleted docs in the two 
>>> cases. Is
>>> the number of deletedDocs the same (or close)? And does the q=*:* query
>>> return the same numFound?
>>> 
>>> Finally, are you absolutely and totally sure that no other options changed. 
>>> For instance,
>>> you specified docValues=true for some field in one but not the other. Or 
>>> stored=true
>>> etc. If you’re using the same schema.
>>> 
>>> And you also haven’t provided information on what versions of Solr you’re 
>>> talking about.
>>> You mention 7.7.2, but not the _other_ version of solr. If you’re going 
>>> from one major
>>> version to another, sometimes defaults change for docValues on primitive 
>>> fields
>>> especially. I’d consider firing up Luke and examining the field definitions 
>>> in
>>> detail.
>>> 
>>> Best,
>>> Erick
>>> 
 On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
 
 Hi,-
 Thanks.
 These are final finished sizes in both cases.
 Best regards
 
 
> On Nov 12, 2020, at 11:12 PM, Erick Erickson  
> wrote:
> 
> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> “fixed” and the version is 8.0
> 
> As for your other question, index size is a very imprecise number. How 
> many deleted documents are there
> in each case? Deleted documents take up disk space until the segments 
> containing them are merged away.
> 
> Best,
> Erick
> 
>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>> 
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>> 
>> 
>> Hi,-
>> 
>> is this issue fixed please? Could You please help me figure it out?
>> 
>> Best regards
>> 
>> 
>> 
>> ---

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Nothing changed between two index generations except the data changed a 
bit as i described.


When Lucene is done generating index, that is what i am reporting as the 
size of the directory where all index files are stored.


I dont know about deleted docs? How do you trace that? yes the queries 
run exactly the same way (same number of results) most of the time the 
order is just changed which is fine; or some few different entries show 
up and i dont know why since lowecase filter should normalize even if 
original data casing changes.


Yes absolutely sure nothing else changed. i kept all those things the 
same across two runs.


actually does lucene repository have these kinda experiments accross 
versions (major or minor versions)?


if i were lucene i would do these experiments to see the impact on index 
end results. this will help find out some potential un-indentified bugs.


Methodology:

have a large dataset like 15 million docs

run index at each time a new version comes out with very common settings.


i am not using solr, pure lucene 7.7.2. these info were in the other 
email here. let me copy paste here:




= previous email 

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters


but i used in indexer the lowercase filter in both cases so indexing is 
done with all lower cases and i saw the first case's index size for case 
is like 9.5GB


but same data size for second case was 11GB.


what causes such difference and increase in index size? amount of docs 
are the same in both cases.



Best regards



On 11/13/20 7:39 AM, Erick Erickson wrote:

What does “final finished sizes” mean? After optimize of just after finishing 
all indexing?
The former is what counts here.

And you provided no information on the number of deleted docs in the two cases. 
Is
the number of deletedDocs the same (or close)? And does the q=*:* query
return the same numFound?

Finally, are you absolutely and totally sure that no other options changed. For 
instance,
you specified docValues=true for some field in one but not the other. Or 
stored=true
etc. If you’re using the same schema.

And you also haven’t provided information on what versions of Solr you’re 
talking about.
You mention 7.7.2, but not the _other_ version of solr. If you’re going from 
one major
version to another, sometimes defaults change for docValues on primitive fields
especially. I’d consider firing up Luke and examining the field definitions in
detail.

Best,
Erick


On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:

Hi,-
Thanks.
These are final finished sizes in both cases.
Best regards



On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:

Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” 
and the version is 8.0

As for your other question, index size is a very imprecise number. How many 
deleted documents are there
in each case? Deleted documents take up disk space until the segments 
containing them are merged away.

Best,
Erick


On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:

https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$


Hi,-

is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which Lucene 8.5.X is recommended?

2020-11-12 Thread baris . kazar
Thanks, i will use 8.5.2.

i think saw some minor release switch on (z) without any issues but i will 
double check this. 

However, i will use 8.5.2 since the bug fixes in that release may result in 
better performance for Lucene index.

Best regards

> On Nov 12, 2020, at 11:09 PM, Erick Erickson  wrote:
> 
> Always use the most recent point release. The only time we go from x.y.z to 
> x.y.z+1 is if there are _significant_ problems. This is much different than 
> going from x.y to x.y+1...
> 
>> On Nov 12, 2020, at 5:49 PM, baris.ka...@oracle.com wrote:
>> 
>> Hi,-
>> 
>> is it best to use 8.5.2?
>> 
>> Best regards
>> 
>> 
>> 
>> Release 8.5.2
>> Bug Fixes   (1)
>> LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein automata 
>> on FuzzyQuery can end up blowing up query caches which use query objects as 
>> cache keys, so building the automata is now delayed to search time again.
>> (Alan Woodward, Mike Drob
>> 
>> 
>> Release 8.5.1 [2020-04-16]
>> Bug Fixes   (1)
>> LUCENE-9300: Fix corruption of the new gen field infos when doc values 
>> updates are applied on a segment created externally and added to the index 
>> with IndexWriter#addIndexes(Directory).
>> (Jim Ferenczi, Adrien Grand)
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which Lucene 8.5.X is recommended?

2020-11-12 Thread baris . kazar
Thanks, i will use 8.5.2.

i think saw some minor release (z) without any issues but i will double check 
this. 
However, i will use 8.5.2. The bug fixes in that release may result in better 
performance.

Best regards

> On Nov 12, 2020, at 11:09 PM, Erick Erickson  wrote:
> Always use the most recent point release. The only time we go from x.y.z to 
> x.y.z+1 is if there are _significant_ problems. This is much different than 
> going from x.y to x.y+1...
> 
>> On Nov 12, 2020, at 5:49 PM, baris.ka...@oracle.com wrote:
>> 
>> Hi,-
>> 
>> is it best to use 8.5.2?
>> 
>> Best regards
>> 
>> 
>> 
>> Release 8.5.2
>> Bug Fixes   (1)
>> LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein automata 
>> on FuzzyQuery can end up blowing up query caches which use query objects as 
>> cache keys, so building the automata is now delayed to search time again.
>> (Alan Woodward, Mike Drob
>> 
>> 
>> Release 8.5.1 [2020-04-16]
>> Bug Fixes   (1)
>> LUCENE-9300: Fix corruption of the new gen field infos when doc values 
>> updates are applied on a segment created externally and added to the index 
>> with IndexWriter#addIndexes(Directory).
>> (Jim Ferenczi, Adrien Grand)
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar
Hi,-
Thanks.
These are final finished sizes in both cases.
Best regards


> On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:
> 
> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> “fixed” and the version is 8.0
> 
> As for your other question, index size is a very imprecise number. How many 
> deleted documents are there
> in each case? Deleted documents take up disk space until the segments 
> containing them are merged away.
> 
> Best,
> Erick
> 
>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>> 
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>  
>> 
>> 
>> Hi,-
>> 
>> is this issue fixed please? Could You please help me figure it out?
>> 
>> Best regards
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Which Lucene 8.5.X is recommended?

2020-11-12 Thread baris . kazar

Hi,-

 is it best to use 8.5.2?

Best regards



Release 8.5.2
Bug Fixes   (1)
LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein 
automata on FuzzyQuery can end up blowing up query caches which use 
query objects as cache keys, so building the automata is now delayed to 
search time again.

(Alan Woodward, Mike Drob


Release 8.5.1 [2020-04-16]
Bug Fixes   (1)
LUCENE-9300: Fix corruption of the new gen field infos when doc values 
updates are applied on a segment created externally and added to the 
index with IndexWriter#addIndexes(Directory).

(Jim Ferenczi, Adrien Grand)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters


but i used in indexer the lowercase filter in both cases so indexing is 
done with all lower cases and i saw the first case's index size for case 
is like 9.5GB


but same data size for second case was 11GB.


what causes such difference and increase in index size? amount of docs 
are the same in both cases.



Best regards


On 11/12/20 5:35 PM, baris.ka...@oracle.com wrote:
https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!NnYqJL-FnBxofO27fztVvIe8fT0uLvT94d1qak6Dbtv5PMc20m6dUed4XDVUSglwDw$ 



Hi,-

 is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar

https://issues.apache.org/jira/browse/LUCENE-8448


Hi,-

 is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Links to classes missing for BMW

2020-10-12 Thread baris . kazar
Hi Adrien,-
Great, thanks.
Best regards

> On Oct 12, 2020, at 1:13 PM, Adrien Grand  wrote:
> 
> It's not the most visible place, but the paper is referenced in the source
> code of the class that implements BM WAND
> https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/907d1142fa435451b40c072f1d445ee868044b15/lucene/core/src/java/org/apache/lucene/search/WANDScorer.java*L29-L44__;Iw!!GqivPVa7Brio!MPU-iGZJ5mI4FRZHv_49soQwf46GbEo5EV_ONdmAfCtkq4iyxHm9EtTXrBnkktQyQg$
>  
> .
> 
>> On Mon, Oct 12, 2020 at 6:34 PM  wrote:
>> 
>> Hi Uwe,-
>> 
>>  i see, thanks for the info, i wish the documentation mentions this new
>> algorithm by referencing the papers (i have the papers).
>> 
>> Best regards
>> 
>> 
>>> On 10/12/20 12:27 PM, Uwe Schindler wrote:
>>> There's not much new documentation, it works behind scenes, except that
>> IndexSearcher.search and TopDocs class no longer return an absolute count
>> for totalHits and instead this class:
>> https://urldefense.com/v3/__https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/TotalHits.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTKvKR__kQ$
>>> 
>>> Uwe
>>> 
>>> Am October 12, 2020 4:22:43 PM UTC schrieb baris.ka...@oracle.com:
 Hi Uwe,-
 
  Could You please point me to the class documentation please?
 
 Best regards
 
 
 On 10/12/20 12:16 PM, Uwe Schindler wrote:
> BMW support is in Lucene since version 8.0.
> 
> Uwe
> 
> Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:
> 
> Hi,-
> 
>Is BMW (Block Max Wand) support only for Solr?
> 
> 
>> https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTLkNmbQlw$
 <
>> https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJSzSrgW3A$
>>> 
> This pages says "also" so it implies support for Lucene, too,
 right?
> Best regards
> 
 
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> 
>> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$
> 
 <
>> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!PrzCrebVbXvOC6GhctJ1mj8CW5Xps_OiWG7ieYh_NuriXPSFIriiBXEKjJQldTepBw$
>>> 
 
>>> --
>>> Uwe Schindler
>>> Achterdiek 19, 28357 Bremen
>>> 
>> https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> -- 
> Adrien


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Links to classes missing for BMW

2020-10-12 Thread baris . kazar

Hi Uwe,-

 i see, thanks for the info, i wish the documentation mentions this new 
algorithm by referencing the papers (i have the papers).


Best regards


On 10/12/20 12:27 PM, Uwe Schindler wrote:

There's not much new documentation, it works behind scenes, except that 
IndexSearcher.search and TopDocs class no longer return an absolute count for 
totalHits and instead this class: 
https://urldefense.com/v3/__https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/TotalHits.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTKvKR__kQ$

Uwe

Am October 12, 2020 4:22:43 PM UTC schrieb baris.ka...@oracle.com:

Hi Uwe,-

  Could You please point me to the class documentation please?

Best regards


On 10/12/20 12:16 PM, Uwe Schindler wrote:

BMW support is in Lucene since version 8.0.

Uwe

Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:

 Hi,-

    Is BMW (Block Max Wand) support only for Solr?

 
https://urldefense.com/v3/__https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTLkNmbQlw$



 This pages says "also" so it implies support for Lucene, too,

right?

 Best regards




 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$





--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://urldefense.com/v3/__https://www.thetaphi.de__;!!GqivPVa7Brio!NsVHzhvGTA9P12ZIyQAjPZwTUjkcQf-sLoYAnRSG_HCVgwtfetbKY48FWTJOeW75UA$


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Links to classes missing for BMW

2020-10-12 Thread baris . kazar

Hi Uwe,-

 Could You please point me to the class documentation please?

Best regards


On 10/12/20 12:16 PM, Uwe Schindler wrote:

BMW support is in Lucene since version 8.0.

Uwe

Am October 12, 2020 4:08:42 PM UTC schrieb baris.ka...@oracle.com:

Hi,-

   Is BMW (Block Max Wand) support only for Solr?

https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html  


This pages says "also" so it implies support for Lucene, too, right?

Best regards

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de 
 




Links to classes missing for BMW

2020-10-12 Thread baris . kazar

Hi,-

 Is BMW (Block Max Wand) support only for Solr?

https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html

This pages says "also" so it implies support for Lucene, too, right?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [VOTE] Lucene logo contest, here we go again

2020-09-01 Thread baris . kazar

Hi,-

 i am afraid this voting should be redone again as all of C's are the same.

I hope i am not doing something wrong. when i download each C candidate, 
i see the same thing.


I wonder why nobody said anything, am i doing something wrong here?

Best regards


On 8/31/20 8:26 PM, Ryan Ernst wrote:

Dear Lucene and Solr developers!

In February a contest was started to design a new logo for Lucene
[jira-issue]. The initial attempt [first-vote] to call a vote resulted in
some confusion on the rules, as well the request for one additional
submission. I would like to call a new vote, now with more explicit
instructions on how to vote.

*Please read the following rules carefully* before submitting your vote.

*Who can vote?*

Anyone is welcome to cast a vote in support of their favorite
submission(s). Note that only PMC member's votes are binding. If you are a
PMC member, please indicate with your vote that the vote is binding, to
ease collection of votes. In tallying the votes, I will attempt to verify
only those marked as binding.


*How do I vote?*
Votes can be cast simply by replying to this email. It is a ranked-choice
vote [rank-choice-voting]. Multiple selections may be made, where the order
of preference must be specified. If an entry gets more than half the votes,
it is the winner. Otherwise, the entry with the lowest number of votes is
removed, and the votes are retallied, taking into account the next
preferred entry for those whose first entry was removed. This process
repeats until there is a winner.

The entries are broken up by variants, since some entries have multiple
color or style variations. The entry identifiers are first a capital
letter, followed by a variation id (described with each entry below), if
applicable. As an example, if you prefer variant 1 of entry A, followed by
variant 2 of entry A, variant 3 of entry C, entry D, and lastly variant 4e
of entry B, the following should be in your reply:

(binding)
vote: A1, A2, C3, D, B4e

*Entries*

The entries are as follows:

A*.* Submitted by Dustin Haver. This entry has two variants, A1 and A2.

[A1]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12999548/Screen*20Shot*202020-04-10*20at*208.29.32*20AM.png__;JSUlJSU!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JguMoPrRMw$
[A2]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgsWXGG_HQ$

B. Submitted by Stamatis Zampetakis. This has several variants. Within the
linked entry there are 7 patterns and 7 color palettes. Any vote for B
should contain the pattern number followed by the lowercase letter of the
color palette. For example, B3e or B1a.

[B]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgvPMUvwrQ$

C. Submitted by Baris Kazar. This entry has 8 variants.

[C1]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgtoX1z6Yg$
[C2]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo2_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgtpAnIVGg$
[C3]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo3_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgsyPALnew$
[C4]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo4_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18Jgvd5JwhBA$
[C5]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo5_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgtP0Ld2BA$
[C6]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo6_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18Jguq0D1MGw$
[C7]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo7_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JguYh3V8sA$
[C8]
https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo8_full.pdf__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgtyvWQnJw$

D. The current Lucene logo.

[D] 
https://urldefense.com/v3/__https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png__;!!GqivPVa7Brio!JgXZ50SROMPIvwQUnc6YZqLl0mBhVxdDyqRU8SwN7lRfSROEh7KwzR18JgvnNovV6A$

Please vote for one of the above choices. This vote will close one week
from today, Mon, Sept 7, 2020 at 11:59PM.

Thanks!

[jira-iss

Re: [VOTE] Lucene logo contest

2020-06-18 Thread baris . kazar

Hi Ryan,-

 I very much appreciate this oppurtunity to submit my designs.

Best regards


On 6/18/20 1:29 AM, Ryan Ernst wrote:

> IMHO this vote is invalid because...
> it doesn’t include the red / orange variants submitted by Dustin Haver

I considered the latest submission by Dustin Haver to be his 
submission, but I can see how some might like the other better and it 
should have been part of the vote.


> I propose to restart the VOTE to include all submissions.

Given that I omitted the submission above, that seems reasonable. And 
since we are restarting, I guess we can allow Baris to add in an entry.


Baris, please add your entry to the jira issue. I will restart the 
vote next week.


> If we're going to have more options, I suggest we use "ranked voting"

I considered rank voting, but tallying a rank vote by hand can be 
incredibly tedious. I don't think we should use any external tools 
since that prohibits verification on who is voting from the PMC. 
However, given the lastingness of this decision, I guess it is fair to 
do the necessary harder tallying work of rank choice voting over 
email. When I restart the vote, I will give instructions on making 
multiple selections.


So, consider this vote CLOSED and VOID.



On Wed, Jun 17, 2020 at 8:27 AM David Smiley > wrote:


If we're going to have more options, I suggest we use "ranked
voting": https://en.wikipedia.org/wiki/Ranked_voting



If you create a Google Form based submission which supports a
ranked choice input, then this should make it probably not hard to
tally the results correctly. A PMC boolean would be helpful too.

~ David


On Wed, Jun 17, 2020 at 11:14 AM Andrzej Białecki mailto:a...@getopt.org>> wrote:

IMHO this vote is invalid because it doesn’t include all
submissions linked to that issue. Specifically, it doesn’t
include the red / orange variants submitted by Dustin Haver
(which I personally prefer over the sickly green ones … ;) )

I propose to restart the VOTE to include all submissions.


On 17 Jun 2020, at 17:04, Adrien Grand mailto:jpou...@gmail.com>> wrote:

A. (PMC) I like that it retains the same idea as our current
logo with a more modern look.

On Wed, Jun 17, 2020 at 4:58 PM Andi Vajda mailto:o...@ovaltofu.org>> wrote:


C. (current logo)

Andi.. (pmc)


On Jun 15, 2020, at 15:08, Ryan Ernst mailto:r...@iernst.net>> wrote:


Dear Lucene and Solr developers!

In February a contest was started to design a new logo
for Lucene [1]. That contest concluded, and I am now
(admittedly a little late!) calling a vote.

The entries are labeled as follows:

A. Submitted by Dustin Haver [2]

B. Submitted by Stamatis Zampetakis [3] Note that this
has several variants. Within the linked entry there are
7 patterns and 7 color palettes. Any vote for B should
contain the pattern number, like B1 or B3. If a B
variant wins, we will have a followup vote on the color
palette.

C. The current Lucene logo [4]

Please vote for one of the three (or nine depending on
your perspective!) above choices. Note that anyone in
the Lucene+Solr community is invited to express their
opinion, though only Lucene+Solr PMC cast binding votes
(indicate non-binding votes in your reply, please). This
vote will close one week from today, Mon, June 22, 2020.

Thanks!

[1] https://issues.apache.org/jira/browse/LUCENE-9221


[2]

https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png


[3]

https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf


[4]

https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png



Re: [VOTE] Lucene logo contest

2020-06-18 Thread baris . kazar

Hi Ryan,-

 That sounds awesome, i found my designs and i am so excited. Even if i 
dont win, it is amazing to submit designs.


Best regards



On 6/18/20 1:32 AM, Ryan Ernst wrote:

Hi Baris,

Please see my latest reply on this thread. We will be restarting the 
vote next week, so you can submit your entry if you would like it to 
be included.


You can submit the entry as an attachment, with a comment, on 
https://issues.apache.org/jira/browse/LUCENE-9221 
. 
I'd like to restart it a week from now; please let me know if you can 
add your entry by then.


Ryan

On Mon, Jun 15, 2020 at 4:51 PM Ryan Ernst > wrote:


Hi Baris,

Unfortunately I think it is already too late. The Jira issue
linked in the original email was where submissions should have
gone. I took your original reply as an indication you had ideas
and would submit them. Opening it up now to additional entries
could be a very slippery slope, which I think could confuse voters.

Sorry about the confusion.

Ryan

On Mon, Jun 15, 2020 at 4:24 PM mailto:baris.ka...@oracle.com>> wrote:

Ryan,-

  Since i did not get any replies back, i did not send anything.

Can i still send? But i need to find where they are n my files :)

Best regards


On 6/15/20 6:08 PM, Ryan Ernst wrote:
> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for
Lucene [1]. That
> contest concluded, and I am now (admittedly a little late!)
calling a vote.
>
> The entries are labeled as follows:
>
> A. Submitted by Dustin Haver [2]
>
> B. Submitted by Stamatis Zampetakis [3] Note that this has
several
> variants. Within the linked entry there are 7 patterns and 7
color
> palettes. Any vote for B should contain the pattern number,
like B1 or B3.
> If a B variant wins, we will have a followup vote on the
color palette.
>
> C. The current Lucene logo [4]
>
> Please vote for one of the three (or nine depending on your
perspective!)
> above choices. Note that anyone in the Lucene+Solr community
is invited to
> express their opinion, though only Lucene+Solr PMC cast
binding votes
> (indicate non-binding votes in your reply, please). This
vote will close
> one week from today, Mon, June 22, 2020.
>
> Thanks!
>
> [1]

https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-9221__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXd7atFPSw$
> [2]
>

https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12999548/Screen*20Shot*202020-04-10*20at*208.29.32*20AM.png__;JSUlJSU!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXc17n7vpA$
> [3]
>

https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXdl4GUFZw$
> [4]

https://urldefense.com/v3/__https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXeUezgPVg$
>



Re: [VOTE] Lucene logo contest

2020-06-16 Thread baris . kazar
Hello,-
i would like to just say that i produced
3 more designs last Feb but forgot to
submit them.

I will need to look for where they are in
my office. I drew them on my post-its and
lot of folks liked them.

Can there be some extension to this voting process please? I know this might 
confuse 
folks voting but i just wanted to share those
designs before final decision is made.

Best regards

>> On 6/15/20 6:08 PM, Ryan Ernst wrote:
>> Dear Lucene and Solr developers!
>> 
>> In February a contest was started to design a new logo for Lucene [1]. That
>> contest concluded, and I am now (admittedly a little late!) calling a vote.
>> 
>> The entries are labeled as follows:
>> 
>> A. Submitted by Dustin Haver [2]
>> 
>> B. Submitted by Stamatis Zampetakis [3] Note that this has several
>> variants. Within the linked entry there are 7 patterns and 7 color
>> palettes. Any vote for B should contain the pattern number, like B1 or B3.
>> If a B variant wins, we will have a followup vote on the color palette.
>> 
>> C. The current Lucene logo [4]
>> 
>> Please vote for one of the three (or nine depending on your perspective!)
>> above choices. Note that anyone in the Lucene+Solr community is invited to
>> express their opinion, though only Lucene+Solr PMC cast binding votes
>> (indicate non-binding votes in your reply, please). This vote will close
>> one week from today, Mon, June 22, 2020.
>> 
>> Thanks!
>> 
>> [1] 
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-9221__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXd7atFPSw$
>> [2]
>> https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12999548/Screen*20Shot*202020-04-10*20at*208.29.32*20AM.png__;JSUlJSU!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXc17n7vpA$
>> [3]
>> https://urldefense.com/v3/__https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXdl4GUFZw$
>> [4] 
>> https://urldefense.com/v3/__https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png__;!!GqivPVa7Brio!LR65po7t_5RkIQT6yYdQU0iN7X83NMw-IJmY-LJzqLCNahlcqSab-wDrEXeUezgPVg$
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to tell Lucene index search to stop when it takes too long

2020-02-28 Thread baris . kazar

I have one more question on this, should i use Thread to use this class?

The snippet did not have that.

Best regards


On 2/28/20 11:07 AM, baris.ka...@oracle.com wrote:

Thanks Mikhail. I missed that cosntructor's first parameter.

Best regards


On 2/28/20 12:53 AM, Mikhail Khludnev wrote:

Pass TopDocsCollector as the first arg into TimeLimitingCollector.

On Thu, Feb 27, 2020 at 2:31 PM  wrote:


Hi,-

Sometimes the search takes too long even with PhraseWildcardQuery, so i
would like to limit the search time via TimeLimitingCollector API.


Thanks to Mikhail and this Forum to inform me about this API.


i checked this IndexSearcher API with Collector parameter but that API
does not have top n results as parameter:


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/IndexSearcher.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordNGGPkoIg$ 




and the TimeLimitingCollector API:


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TimeLimitingCollector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordPc66p04w$ 




Is there a simple example where both IndexSearcher and
TimeLimitingCollector APIs are combined?

I found this snippet and i would like to add the missing pieces here:


Counter clock = ...;
long baseline = clock.get();
// ... prepare search -> i think this means create the query here
TimeLimitingCollector collector = new TimeLimitingCollector(c, clock,
numTicks);
collector.setBaseline(baseline);
indexSearcher.search(query, collector);

// But i also would like to do the following: and how do i get results
from TimeLimitingCollector in that so much allowed time?

int totalHits = collector.totalHits;

TopDocs topdocs = collector.topdocs(); // But i cant specify Top n docs
here, right?


The collector is defined here


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/Collector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordO9STfOxQ$ 




https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TopDocsCollector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordNBKS8o3w$ 



Best regards



On 2/25/20 1:50 AM, baris.ka...@oracle.com wrote:

Will do, Thanks


On Feb 25, 2020, at 1:34 AM, Mikhail Khludnev  
wrote:


Hello.

Meet org.apache.lucene.search.TimeLimitingCollector.


On Mon, Feb 24, 2020 at 2:51 PM  wrote:

Hi,-

I hope everyone is doing great.


i am trying to find an api to tell Lucene Index Searcher to stop 
after

0.5 seconds (when it takes longer than this).

Is there such an api or plan to implement one?


Best regards



- 


To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Sincerely yours
Mikhail Khludnev

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET

2020-02-28 Thread baris . kazar

Thanks Michael.

Best regards


On 2/24/20 7:18 PM, Michael Froh wrote:
Those words 
(https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.1/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardAnalyzer.java#L44-L49 
) 
have been moved to EnglishAnalyzer 
(https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.4.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L47-L51 
).


On Mon, 24 Feb 2020 at 15:56, > wrote:


Hi,-

  I hope everyone is doing great.

What is the Lucene 8.4.1 equivalent for
StandardAnalyzer.STOP_WORDS_SET?


https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html#STOP_WORDS_SET




https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html



Best regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org




maxMultiTermExpansions parameter of PhraseWildcardQuery class in Lucene 8.4.1 Sandbox

2020-02-28 Thread baris . kazar

Hi,-

i hope everyone is doing great.

i set this parameter as Integer.MAX_VALUE and it is mostly working only 
1 time had Memory issue.


However, by reducing this parameter how will it affect the search time 
and quality of search results?


Has anybody done such an experiment?

The explanation for this parameter is as follows:

maxMultiTermExpansions - The maximum number of expansions across all 
multi-terms and across all segments.


It counts expansions for each segments individually, that allows 
optimizations per segment and unused expansions are credited to next 
segments.


This is different from MultiPhraseQuery and SpanMultiTermQueryWrapper 
which have an expansion limit per multi-term.


https://lucene.apache.org/core/8_4_1/sandbox/index.html

It would be aweome some toy example here, too.

Best regards

baris


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to tell Lucene index search to stop when it takes too long

2020-02-28 Thread baris . kazar

Thanks Mikhail. I missed that cosntructor's first parameter.

Best regards


On 2/28/20 12:53 AM, Mikhail Khludnev wrote:

Pass TopDocsCollector as the first arg into TimeLimitingCollector.

On Thu, Feb 27, 2020 at 2:31 PM  wrote:


Hi,-

Sometimes the search takes too long even with PhraseWildcardQuery, so i
would like to limit the search time via TimeLimitingCollector API.


Thanks to Mikhail and this Forum to inform me about this API.


i checked this IndexSearcher API with Collector parameter but that API
does not have top n results as parameter:


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/IndexSearcher.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordNGGPkoIg$


and the TimeLimitingCollector API:


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TimeLimitingCollector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordPc66p04w$


Is there a simple example where both IndexSearcher and
TimeLimitingCollector APIs are combined?

I found this snippet and i would like to add the missing pieces here:


Counter clock = ...;
long baseline = clock.get();
// ... prepare search -> i think this means create the query here
TimeLimitingCollector collector = new TimeLimitingCollector(c, clock,
numTicks);
collector.setBaseline(baseline);
indexSearcher.search(query, collector);

// But i also would like to do the following: and how do i get results
from TimeLimitingCollector in that so much allowed time?

int totalHits = collector.totalHits;

TopDocs topdocs = collector.topdocs(); // But i cant specify Top n docs
here, right?


The collector is defined here


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/Collector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordO9STfOxQ$


https://urldefense.com/v3/__https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TopDocsCollector.html__;!!GqivPVa7Brio!P89Fobv9VblNMjb3Z0bNFecF2ZKVAGKkXYDs08U5jSjitEIB46jd8d7ordNBKS8o3w$

Best regards



On 2/25/20 1:50 AM, baris.ka...@oracle.com wrote:

Will do, Thanks



On Feb 25, 2020, at 1:34 AM, Mikhail Khludnev  wrote:

Hello.

Meet org.apache.lucene.search.TimeLimitingCollector.


On Mon, Feb 24, 2020 at 2:51 PM  wrote:

Hi,-

I hope everyone is doing great.


i am trying to find an api to tell Lucene Index Searcher to stop after
0.5 seconds (when it takes longer than this).

Is there such an api or plan to implement one?


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Sincerely yours
Mikhail Khludnev

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to tell Lucene index search to stop when it takes too long

2020-02-27 Thread baris . kazar

Hi,-

Sometimes the search takes too long even with PhraseWildcardQuery, so i 
would like to limit the search time via TimeLimitingCollector API.



Thanks to Mikhail and this Forum to inform me about this API.


i checked this IndexSearcher API with Collector parameter but that API 
does not have top n results as parameter:


https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/IndexSearcher.html


and the TimeLimitingCollector API:

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TimeLimitingCollector.html


Is there a simple example where both IndexSearcher and 
TimeLimitingCollector APIs are combined?


I found this snippet and i would like to add the missing pieces here:


Counter clock = ...;
long baseline = clock.get();
// ... prepare search -> i think this means create the query here
TimeLimitingCollector collector = new TimeLimitingCollector(c, clock, 
numTicks);

collector.setBaseline(baseline);
indexSearcher.search(query, collector);

// But i also would like to do the following: and how do i get results 
from TimeLimitingCollector in that so much allowed time?


int totalHits = collector.totalHits;

TopDocs topdocs = collector.topdocs(); // But i cant specify Top n docs 
here, right?



The collector is defined here

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/Collector.html

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/search/TopDocsCollector.html

Best regards



On 2/25/20 1:50 AM, baris.ka...@oracle.com wrote:

Will do, Thanks



On Feb 25, 2020, at 1:34 AM, Mikhail Khludnev  wrote:

Hello.

Meet org.apache.lucene.search.TimeLimitingCollector.


On Mon, Feb 24, 2020 at 2:51 PM  wrote:

Hi,-

I hope everyone is doing great.


i am trying to find an api to tell Lucene Index Searcher to stop after
0.5 seconds (when it takes longer than this).

Is there such an api or plan to implement one?


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Sincerely yours
Mikhail Khludnev


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

2020-02-26 Thread baris . kazar

Followup on this thread:

i ended up using WildcardQuery with "*" at the end of last token for 
PhraseWildcardQuery class from the sandbox,



i tested this class rigorously and i think it is ready to move it from 
sandbox jar to the appropriate release jar.


Is there a plan for that?


PhraseWildcardQuery is on ave 3-4 times faster than 
ComplexPhraseQueryParser class and gives same result.



I did some more naive enhancements on top of ComplexPhraseQueryParser 
results and i plan to do those with this new class which will bring down 
the execution another 2 to 3 times more.



Best regards


On 2/21/20 12:34 PM, baris.ka...@oracle.com wrote:

Hi,-

 Looks like the only way to use and test the new PhraseWildCardQuery 
class in Lucene 8.4.0 sandbox is to switch to Lucene 8.4.0 from Lucene 
7.7.2.


I thought i could adapt it to Lucene 7.7.2 but so far i saw i needed 
to change heavily 20+ classes and it will be way more than this.


So, if anybody wants to use this new amazing class You need to on 
Lucene 8.4.0.


http://lucene.apache.org/core/8_4_0/sandbox/index.html

Best regards


On 2/19/20 5:41 PM, baris.ka...@oracle.com wrote:

Hi,-

 is there a JAR file for the classes in the 
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search 
and index and analysis directories?


https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search 
does not have PhraseWildcardQuery class, though.


As Michael mentioned, i pulled it from

https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java 




It seems that many classes in these directories are incompatible with 
Lucene Version 7.7.2. Probably these are from Lucene 8.x series.


It will be very nice to have a JAR file to be able to use all these 
classes together with Lucene 7.x versions.



Best regards


On 2/19/20 3:42 PM, baris.ka...@oracle.com wrote:

Hi,-

Thanks again Michael, David and Bruno and the Forum for letting me 
know this repository.


The version of PhraseWildCardQuery on 
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDByacVbfA$ 
uses some classes not available Lucene version 7.7.2.


There is a bunch of new and modified classes used by 
PhraseWildCardquery class such as QueryVisitor, ScoreMode etc.


I will try to add these classes from 
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDDFbpGxRQ$ 


and i hope it will work with Lucene 7.7.2.

Best regards



On 2/18/20 8:49 PM, baris.ka...@oracle.com wrote:

Michael and Forum,-
This is amazing, thanks.

i will try both cases.

i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards


On Feb 18, 2020, at 8:33 PM, Michael Froh  wrote:


In your example, it looks like you wanted the second term to match 
based on the first character, or prefix, of the term.


While you could use a WildcardQuery with a term value of 
"term2FirstChar*", PrefixQuery seemed like the simpler approach. 
WildcardQuery can handle more general cases, like if you want to 
match on something like "a*b*c".


Technically, the PrefixQuery compiles down to a slightly simpler 
automaton, but I only figured that out by writing a simple unit test:


    public void testAutomata() {
        Automaton prefixAutomaton = PrefixQuery.toAutomaton(new 
BytesRef("a"));
        Automaton wildcardAutomaton = 
WildcardQuery.toAutomaton(new Term("foo", "a*"));


        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U-\\U00ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U-\\U0010"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U-\\U0010"]
}



On Tue, 18 Feb 2020 at 13:52, > wrote:


    Michael and Forum,-
    Thanks for thegreat explanations.

    one question please:

    why is PrefixQuery used instead of WildCardQuery in the below
 

Re: How to tell Lucene index search to stop when it takes too long

2020-02-24 Thread baris . kazar
Will do, Thanks


> On Feb 25, 2020, at 1:34 AM, Mikhail Khludnev  wrote:
> 
> Hello.
> 
> Meet org.apache.lucene.search.TimeLimitingCollector.
> 
>> On Mon, Feb 24, 2020 at 2:51 PM  wrote:
>> 
>> Hi,-
>> 
>> I hope everyone is doing great.
>> 
>> 
>> i am trying to find an api to tell Lucene Index Searcher to stop after
>> 0.5 seconds (when it takes longer than this).
>> 
>> Is there such an api or plan to implement one?
>> 
>> 
>> Best regards
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET

2020-02-24 Thread baris . kazar

Hi,-

 I hope everyone is doing great.

What is the Lucene 8.4.1 equivalent for StandardAnalyzer.STOP_WORDS_SET?

https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html#STOP_WORDS_SET

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html

Best regards

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 7.7.2 Indexwriter.numDocs() replacement in Lucene 8.4.1

2020-02-24 Thread baris . kazar

Hi,-

 I hope everyone is doing great.


I think the Lucene 7.7.2  Indexwriter.numDocs()

https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#numDocs--

can be replaced by the following in Lucene 8.4.1, right?

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.html#getDocStats--
--->>> 
https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.DocStats.html#numDocs


i.e., IndexWriter.DocStats.numDocs

Best regards




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 7.7.2 Indexwriter.numDocs() replacement in Lucene 8.4.1

2020-02-24 Thread baris . kazar

A typo corrected below.

Best regards


On 2/24/20 5:54 PM, baris.ka...@oracle.com wrote:

Hi,-

 I hope everyone is doing great.


I think the Lucene 7.7.2  Indexwriter.numDocs()

https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#numDocs-- 



can be replaced by the following in Lucene 8.4.1, right?

https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.html#getDocStats-- 

--->>> 
https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.DocStats.html#numDocs


i.e./_*, IndexWriter.getDocStats().numDocs *_/

Best regards





How to tell Lucene index search to stop when it takes too long

2020-02-24 Thread baris . kazar

Hi,-

I hope everyone is doing great.


i am trying to find an api to tell Lucene Index Searcher to stop after 
0.5 seconds (when it takes longer than this).


Is there such an api or plan to implement one?


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene download page

2020-02-24 Thread baris . kazar

Thanks Erick and the Forum.

Best regards


On 2/23/20 8:32 AM, Erick Erickson wrote:

No, 7.7.2 was a patch fix that _was_ released after 8.1.1.


On Feb 22, 2020, at 2:49 PM, baris.ka...@oracle.com wrote:

Hi,-

  i hope everyone is doing great.

Licene 7.7.2 is listed as released after Lucene 8.1.1 is released on this page 
https://urldefense.com/v3/__https://lucene.apache.org/core/corenews.html*apache-lucenetm-841-available__;Iw!!GqivPVa7Brio!LvAk4nbYIrrq8we7Kwv5k5nu-1Ml9mCHEUEq_g2HDiSmNIsnn7c00T9GsoYYw_rTEA$

I think the order may need to change there.

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene download page

2020-02-22 Thread baris . kazar

Hi,-

 i hope everyone is doing great.

Licene 7.7.2 is listed as released after Lucene 8.1.1 is released on 
this page 
https://lucene.apache.org/core/corenews.html#apache-lucenetm-841-available


I think the order may need to change there.

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SingleTerm vs MultiTerm in PhraseWildCardQuery class in the sandbox Lucene

2020-02-21 Thread baris . kazar

Hi,-

 Looks like the only way to use and test the new PhraseWildCardQuery 
class in Lucene 8.4.0 sandbox is to switch to Lucene 8.4.0 from Lucene 
7.7.2.


I thought i could adapt it to Lucene 7.7.2 but so far i saw i needed to 
change heavily 20+ classes and it will be way more than this.


So, if anybody wants to use this new amazing class You need to on Lucene 
8.4.0.


http://lucene.apache.org/core/8_4_0/sandbox/index.html

Best regards


On 2/19/20 5:41 PM, baris.ka...@oracle.com wrote:

Hi,-

 is there a JAR file for the classes in the 
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search 
and index and analysis directories?


https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/search 
does not have PhraseWildcardQuery class, though.


As Michael mentioned, i pulled it from

https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java 




It seems that many classes in these directories are incompatible with 
Lucene Version 7.7.2. Probably these are from Lucene 8.x series.


It will be very nice to have a JAR file to be able to use all these 
classes together with Lucene 7.x versions.



Best regards


On 2/19/20 3:42 PM, baris.ka...@oracle.com wrote:

Hi,-

Thanks again Michael, David and Bruno and the Forum for letting me 
know this repository.


The version of PhraseWildCardQuery on 
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/sandbox/src/java/org/apache/lucene/search/PhraseWildcardQuery.java__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDByacVbfA$ 
uses some classes not available Lucene version 7.7.2.


There is a bunch of new and modified classes used by 
PhraseWildCardquery class such as QueryVisitor, ScoreMode etc.


I will try to add these classes from 
https://urldefense.com/v3/__https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene__;!!GqivPVa7Brio!M9dm0zfCQgHUNDsJMygJ5_Im1XhQeqAc-0gAWg-a0Cpt4AqkJB0Bb85olDDFbpGxRQ$ 


and i hope it will work with Lucene 7.7.2.

Best regards



On 2/18/20 8:49 PM, baris.ka...@oracle.com wrote:

Michael and Forum,-
This is amazing, thanks.

i will try both cases.

i can also have "term1 term2Char1term2Char2*"
and so on with term2's next chars.

I hope the latest version on github for this
class works with Lucene Version 7.7.2.

Best regards


On Feb 18, 2020, at 8:33 PM, Michael Froh  wrote:


In your example, it looks like you wanted the second term to match 
based on the first character, or prefix, of the term.


While you could use a WildcardQuery with a term value of 
"term2FirstChar*", PrefixQuery seemed like the simpler approach. 
WildcardQuery can handle more general cases, like if you want to 
match on something like "a*b*c".


Technically, the PrefixQuery compiles down to a slightly simpler 
automaton, but I only figured that out by writing a simple unit test:


    public void testAutomata() {
        Automaton prefixAutomaton = PrefixQuery.toAutomaton(new 
BytesRef("a"));
        Automaton wildcardAutomaton = WildcardQuery.toAutomaton(new 
Term("foo", "a*"));


        System.out.println("PrefixQuery(\"a\")");
        System.out.println(prefixAutomaton.toDot());
        System.out.println("WildcardQuery(\"a*\")");
        System.out.println(wildcardAutomaton.toDot());
    }

That produces the following output:

PrefixQuery("a")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 1 [label="\\U-\\U00ff"]
}
WildcardQuery("a*")
digraph Automaton {
  rankdir = LR
  node [width=0.2, height=0.2, fontsize=8]
  initial [shape=plaintext,label=""]
  initial -> 0
  0 [shape=circle,label="0"]
  0 -> 1 [label="a"]
  1 [shape=doublecircle,label="1"]
  1 -> 2 [label="\\U-\\U0010"]
  2 [shape=doublecircle,label="2"]
  2 -> 2 [label="\\U-\\U0010"]
}



On Tue, 18 Feb 2020 at 13:52, > wrote:


    Michael and Forum,-
    Thanks for thegreat explanations.

    one question please:

    why is PrefixQuery used instead of WildCardQuery in the below
    snippet?

    Best regards

    > On Feb 17, 2020, at 3:01 PM, Michael Froh mailto:msf...@gmail.com>> wrote:
    >
    > Hi Baris,
    >
    > The idea with PhraseWildcardQuery is that you can mix literal
    "exact" terms
    > with "MultiTerms" (i.e. any subclass of MultiTermQuery). Using
    addTerm is
    > for exact terms, while addMultiTerm is for things that may
    match a number
    > of possible terms in the given position.
    >
    > If you want to search for term1 followed by any term that
    starts with a
    > given character, I would suggest using:
    >
    > int maxMultiTermExpansions = ...; // Discussed below
    > PhraseWi

StandardFilter and StandardFilterFactory removed in Lucene 8.x

2020-02-21 Thread baris . kazar

Hi,-

I hope everyone is doing great.


What replaces these classes in Lucene 8.x?

https://issues.apache.org/jira/browse/LUCENE-8356 says they presumably 
do nothing. Is that certain please?



On the other hand: I see that (for example) the Query class has been 
changed quite a lot when someone wants to switch from Lucene 7.x to 
Lucene 8.x.


For example: the boolean type variable needsScore becomes ScoreMode 
object parameter in some Java public apis such as createWeight method of 
Query class:


https://lucene.apache.org/core/7_0_0/core/org/apache/lucene/search/Query.html

https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/search/Query.html


Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   >