Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Great answer
Thanks Michael.

Yes the difference was too much > 1G
Best regards

> On Nov 13, 2020, at 1:49 PM, Michael Sokolov  wrote:
> 
> You can't directly compare disk usage across two indexes, even with
> the same data. Try re-indexing one of your datasets, and you will see
> that the disk size is not the same. Mostly this is due to the way
> segments are merged varying with some randomness from one run to
> another, although the size of the difference you report is pretty
> large, it is not out of the question that could occur, especially if
> you have a large number of deletions or updates to existing documents.
> If you want to get a more accurate idea of the amount of space taken
> up by your index, you could try calling IndexWriter.forceMerge(1);
> this will merge your index to a single segment, eliminating waste. It
> is not generally recommended to do this for indexes you use for
> querying, but it can be a useful tool for analysis.
> 
>> On Fri, Nov 13, 2020 at 1:01 PM  wrote:
>> 
>> Nothing changed between two index generations except the data changed a
>> bit as i described.
>> 
>> When Lucene is done generating index, that is what i am reporting as the
>> size of the directory where all index files are stored.
>> 
>> I dont know about deleted docs? How do you trace that? yes the queries
>> run exactly the same way (same number of results) most of the time the
>> order is just changed which is fine; or some few different entries show
>> up and i dont know why since lowecase filter should normalize even if
>> original data casing changes.
>> 
>> Yes absolutely sure nothing else changed. i kept all those things the
>> same across two runs.
>> 
>> actually does lucene repository have these kinda experiments accross
>> versions (major or minor versions)?
>> 
>> if i were lucene i would do these experiments to see the impact on index
>> end results. this will help find out some potential un-indentified bugs.
>> 
>> Methodology:
>> 
>> have a large dataset like 15 million docs
>> 
>> run index at each time a new version comes out with very common settings.
>> 
>> 
>> i am not using solr, pure lucene 7.7.2. these info were in the other
>> email here. let me copy paste here:
>> 
>> 
>> 
>> = previous email 
>> 
>> On a related issue:
>> 
>> i experience that with Version 7.7.2 i experienced this:
>> 
>> data is all lower case (same amount of docs as next case though)
>> 
>> vs
>> 
>> data is camel case except last word always in capital letters
>> 
>> 
>> but i used in indexer the lowercase filter in both cases so indexing is
>> done with all lower cases and i saw the first case's index size for case
>> is like 9.5GB
>> 
>> but same data size for second case was 11GB.
>> 
>> 
>> what causes such difference and increase in index size? amount of docs
>> are the same in both cases.
>> 
>> 
>> Best regards
>> 
>> 
>> 
>>> On 11/13/20 7:39 AM, Erick Erickson wrote:
>>> What does “final finished sizes” mean? After optimize of just after 
>>> finishing all indexing?
>>> The former is what counts here.
>>> 
>>> And you provided no information on the number of deleted docs in the two 
>>> cases. Is
>>> the number of deletedDocs the same (or close)? And does the q=*:* query
>>> return the same numFound?
>>> 
>>> Finally, are you absolutely and totally sure that no other options changed. 
>>> For instance,
>>> you specified docValues=true for some field in one but not the other. Or 
>>> stored=true
>>> etc. If you’re using the same schema.
>>> 
>>> And you also haven’t provided information on what versions of Solr you’re 
>>> talking about.
>>> You mention 7.7.2, but not the _other_ version of solr. If you’re going 
>>> from one major
>>> version to another, sometimes defaults change for docValues on primitive 
>>> fields
>>> especially. I’d consider firing up Luke and examining the field definitions 
>>> in
>>> detail.
>>> 
>>> Best,
>>> Erick
>>> 
 On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
 
 Hi,-
 Thanks.
 These are final finished sizes in both cases.
 Best regards
 
 
> On Nov 12, 2020, at 11:12 PM, Erick Erickson  
> wrote:
> 
> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> “fixed” and the version is 8.0
> 
> As for your other question, index size is a very imprecise number. How 
> many deleted documents are there
> in each case? Deleted documents take up disk space until the segments 
> containing them are merged away.
> 
> Best,
> Erick
> 
>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>> 
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>> 
>> 
>> Hi,-
>> 
>> is this issue fixed please? Could You please help me figure it out?
>> 
>> Best regards
>> 
>> 
>> 
>> 

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Michael Sokolov
You can't directly compare disk usage across two indexes, even with
the same data. Try re-indexing one of your datasets, and you will see
that the disk size is not the same. Mostly this is due to the way
segments are merged varying with some randomness from one run to
another, although the size of the difference you report is pretty
large, it is not out of the question that could occur, especially if
you have a large number of deletions or updates to existing documents.
If you want to get a more accurate idea of the amount of space taken
up by your index, you could try calling IndexWriter.forceMerge(1);
this will merge your index to a single segment, eliminating waste. It
is not generally recommended to do this for indexes you use for
querying, but it can be a useful tool for analysis.

On Fri, Nov 13, 2020 at 1:01 PM  wrote:
>
> Nothing changed between two index generations except the data changed a
> bit as i described.
>
> When Lucene is done generating index, that is what i am reporting as the
> size of the directory where all index files are stored.
>
> I dont know about deleted docs? How do you trace that? yes the queries
> run exactly the same way (same number of results) most of the time the
> order is just changed which is fine; or some few different entries show
> up and i dont know why since lowecase filter should normalize even if
> original data casing changes.
>
> Yes absolutely sure nothing else changed. i kept all those things the
> same across two runs.
>
> actually does lucene repository have these kinda experiments accross
> versions (major or minor versions)?
>
> if i were lucene i would do these experiments to see the impact on index
> end results. this will help find out some potential un-indentified bugs.
>
> Methodology:
>
> have a large dataset like 15 million docs
>
> run index at each time a new version comes out with very common settings.
>
>
> i am not using solr, pure lucene 7.7.2. these info were in the other
> email here. let me copy paste here:
>
>
>
> = previous email 
>
> On a related issue:
>
> i experience that with Version 7.7.2 i experienced this:
>
> data is all lower case (same amount of docs as next case though)
>
> vs
>
> data is camel case except last word always in capital letters
>
>
> but i used in indexer the lowercase filter in both cases so indexing is
> done with all lower cases and i saw the first case's index size for case
> is like 9.5GB
>
> but same data size for second case was 11GB.
>
>
> what causes such difference and increase in index size? amount of docs
> are the same in both cases.
>
>
> Best regards
>
>
>
> On 11/13/20 7:39 AM, Erick Erickson wrote:
> > What does “final finished sizes” mean? After optimize of just after 
> > finishing all indexing?
> > The former is what counts here.
> >
> > And you provided no information on the number of deleted docs in the two 
> > cases. Is
> > the number of deletedDocs the same (or close)? And does the q=*:* query
> > return the same numFound?
> >
> > Finally, are you absolutely and totally sure that no other options changed. 
> > For instance,
> > you specified docValues=true for some field in one but not the other. Or 
> > stored=true
> > etc. If you’re using the same schema.
> >
> > And you also haven’t provided information on what versions of Solr you’re 
> > talking about.
> > You mention 7.7.2, but not the _other_ version of solr. If you’re going 
> > from one major
> > version to another, sometimes defaults change for docValues on primitive 
> > fields
> > especially. I’d consider firing up Luke and examining the field definitions 
> > in
> > detail.
> >
> > Best,
> > Erick
> >
> >> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
> >>
> >> Hi,-
> >> Thanks.
> >> These are final finished sizes in both cases.
> >> Best regards
> >>
> >>
> >>> On Nov 12, 2020, at 11:12 PM, Erick Erickson  
> >>> wrote:
> >>>
> >>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> >>> “fixed” and the version is 8.0
> >>>
> >>> As for your other question, index size is a very imprecise number. How 
> >>> many deleted documents are there
> >>> in each case? Deleted documents take up disk space until the segments 
> >>> containing them are merged away.
> >>>
> >>> Best,
> >>> Erick
> >>>
>  On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
> 
>  https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
> 
> 
>  Hi,-
> 
>  is this issue fixed please? Could You please help me figure it out?
> 
>  Best regards
> 
> 
> 
>  -
>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> >>>
> >>> 

Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread baris . kazar
Nothing changed between two index generations except the data changed a 
bit as i described.


When Lucene is done generating index, that is what i am reporting as the 
size of the directory where all index files are stored.


I dont know about deleted docs? How do you trace that? yes the queries 
run exactly the same way (same number of results) most of the time the 
order is just changed which is fine; or some few different entries show 
up and i dont know why since lowecase filter should normalize even if 
original data casing changes.


Yes absolutely sure nothing else changed. i kept all those things the 
same across two runs.


actually does lucene repository have these kinda experiments accross 
versions (major or minor versions)?


if i were lucene i would do these experiments to see the impact on index 
end results. this will help find out some potential un-indentified bugs.


Methodology:

have a large dataset like 15 million docs

run index at each time a new version comes out with very common settings.


i am not using solr, pure lucene 7.7.2. these info were in the other 
email here. let me copy paste here:




= previous email 

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters


but i used in indexer the lowercase filter in both cases so indexing is 
done with all lower cases and i saw the first case's index size for case 
is like 9.5GB


but same data size for second case was 11GB.


what causes such difference and increase in index size? amount of docs 
are the same in both cases.



Best regards



On 11/13/20 7:39 AM, Erick Erickson wrote:

What does “final finished sizes” mean? After optimize of just after finishing 
all indexing?
The former is what counts here.

And you provided no information on the number of deleted docs in the two cases. 
Is
the number of deletedDocs the same (or close)? And does the q=*:* query
return the same numFound?

Finally, are you absolutely and totally sure that no other options changed. For 
instance,
you specified docValues=true for some field in one but not the other. Or 
stored=true
etc. If you’re using the same schema.

And you also haven’t provided information on what versions of Solr you’re 
talking about.
You mention 7.7.2, but not the _other_ version of solr. If you’re going from 
one major
version to another, sometimes defaults change for docValues on primitive fields
especially. I’d consider firing up Luke and examining the field definitions in
detail.

Best,
Erick


On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:

Hi,-
Thanks.
These are final finished sizes in both cases.
Best regards



On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:

Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” 
and the version is 8.0

As for your other question, index size is a very imprecise number. How many 
deleted documents are there
in each case? Deleted documents take up disk space until the segments 
containing them are merged away.

Best,
Erick


On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:

https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$


Hi,-

is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Erick Erickson
What does “final finished sizes” mean? After optimize of just after finishing 
all indexing?
The former is what counts here.

And you provided no information on the number of deleted docs in the two cases. 
Is 
the number of deletedDocs the same (or close)? And does the q=*:* query
return the same numFound?

Finally, are you absolutely and totally sure that no other options changed. For 
instance,
you specified docValues=true for some field in one but not the other. Or 
stored=true
etc. If you’re using the same schema.

And you also haven’t provided information on what versions of Solr you’re 
talking about.
You mention 7.7.2, but not the _other_ version of solr. If you’re going from 
one major
version to another, sometimes defaults change for docValues on primitive fields
especially. I’d consider firing up Luke and examining the field definitions in
detail.

Best,
Erick

> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
> 
> Hi,-
> Thanks.
> These are final finished sizes in both cases.
> Best regards
> 
> 
>> On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:
>> 
>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
>> “fixed” and the version is 8.0
>> 
>> As for your other question, index size is a very imprecise number. How many 
>> deleted documents are there
>> in each case? Deleted documents take up disk space until the segments 
>> containing them are merged away.
>> 
>> Best,
>> Erick
>> 
>>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>>> 
>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>  
>>> 
>>> 
>>> Hi,-
>>> 
>>> is this issue fixed please? Could You please help me figure it out?
>>> 
>>> Best regards
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar
Hi,-
Thanks.
These are final finished sizes in both cases.
Best regards


> On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:
> 
> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
> “fixed” and the version is 8.0
> 
> As for your other question, index size is a very imprecise number. How many 
> deleted documents are there
> in each case? Deleted documents take up disk space until the segments 
> containing them are merged away.
> 
> Best,
> Erick
> 
>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>> 
>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>  
>> 
>> 
>> Hi,-
>> 
>> is this issue fixed please? Could You please help me figure it out?
>> 
>> Best regards
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread Erick Erickson
Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” 
and the version is 8.0

As for your other question, index size is a very imprecise number. How many 
deleted documents are there
in each case? Deleted documents take up disk space until the segments 
containing them are merged away.

Best,
Erick

> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
> 
> https://issues.apache.org/jira/browse/LUCENE-8448
> 
> 
> Hi,-
> 
>  is this issue fixed please? Could You please help me figure it out?
> 
> Best regards
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar

On a related issue:

i experience that with Version 7.7.2 i experienced this:

data is all lower case (same amount of docs as next case though)

vs

data is camel case except last word always in capital letters


but i used in indexer the lowercase filter in both cases so indexing is 
done with all lower cases and i saw the first case's index size for case 
is like 9.5GB


but same data size for second case was 11GB.


what causes such difference and increase in index size? amount of docs 
are the same in both cases.



Best regards


On 11/12/20 5:35 PM, baris.ka...@oracle.com wrote:
https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!NnYqJL-FnBxofO27fztVvIe8fT0uLvT94d1qak6Dbtv5PMc20m6dUed4XDVUSglwDw$ 



Hi,-

 is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread baris . kazar

https://issues.apache.org/jira/browse/LUCENE-8448


Hi,-

 is this issue fixed please? Could You please help me figure it out?

Best regards



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org