Re: Indexed Data Size

2019-08-13 Thread Greg Harris
Brett, it’s probably because you hit the 5g default segment size limit on
Solr and in order to merge segments a huge number of the docs within the
segment must be marked as deleted. So even if large amounts of docs are
deleted docs within the segment, the segment is still there, happily taking
up space. That could theoretically be a reason for a optimize, but you’d
want to specify maxsegments with the goal of not merging to a single
segment for the entire index. Ideally you should just keep as many of the
logs as you actually use (which is hopefully more limited than what you are
keeping). Since the segments will be somewhat time based they would
eventually disappear/merge through time, hopefully negating any reason to
consider having to optimize

Greg

On Tue, Aug 13, 2019 at 3:31 PM Moyer, Brett  wrote:

> Turns out this is due to a job that indexes logs. We were able to clear
> some with another job. We are working through the value of these indexed
> logs. Thanks for all your help!
>
> Brett Moyer
> Manager, Sr. Technical Lead | TFS Technology
>   Public Production Support
>   Digital Search & Discovery
>
> 8625 Andrew Carnegie Blvd | 4th floor
> Charlotte, NC 28263
> Tel: 704.988.4508
> Fax: 704.988.4907
> bmo...@tiaa.org
>
> -Original Message-
> From: Shawn Heisey 
> Sent: Friday, August 9, 2019 2:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexed Data Size
>
> On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> > The biggest is /data/solr/system_logs_shard1_replica_n1/data/index,
> files with the extensions I stated previously. Each is 5gb and there are a
> few hundred. Dated by to last 3 months. I don’t understand why there are so
> many files with such small indexes. Not sure how to clean them up.
>
> Can you get a screenshot of the core overview for that particular core?
> Solr should correctly calculate the size on the overview based on what
> files are actually in the index directory.
>
> Thanks,
> Shawn
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA
> *
>


RE: Indexed Data Size

2019-08-13 Thread Moyer, Brett
Turns out this is due to a job that indexes logs. We were able to clear some 
with another job. We are working through the value of these indexed logs. 
Thanks for all your help!

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
bmo...@tiaa.org

-Original Message-
From: Shawn Heisey  
Sent: Friday, August 9, 2019 2:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files 
> with the extensions I stated previously. Each is 5gb and there are a few 
> hundred. Dated by to last 3 months. I don’t understand why there are so many 
> files with such small indexes. Not sure how to clean them up.

Can you get a screenshot of the core overview for that particular core? 
Solr should correctly calculate the size on the overview based on what files 
are actually in the index directory.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Re: Indexed Data Size

2019-08-09 Thread Shawn Heisey

On 8/9/2019 12:17 PM, Moyer, Brett wrote:

The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with 
the extensions I stated previously. Each is 5gb and there are a few hundred. 
Dated by to last 3 months. I don’t understand why there are so many files with 
such small indexes. Not sure how to clean them up.


Can you get a screenshot of the core overview for that particular core? 
Solr should correctly calculate the size on the overview based on what 
files are actually in the index directory.


Thanks,
Shawn


RE: Indexed Data Size

2019-08-09 Thread Moyer, Brett
Correct our indexes are small document wise, but for some ready we have a 
years' worth of files in the data/solr folders. There are no index. 
files.

The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with 
the extensions I stated previously. Each is 5gb and there are a few hundred. 
Dated by to last 3 months. I don’t understand why there are so many files with 
such small indexes. Not sure how to clean them up. 

-Original Message-
From: Shawn Heisey  
Sent: Friday, August 9, 2019 9:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On 8/9/2019 6:12 AM, Moyer, Brett wrote:
> Thanks! We update each index nightly, we don’t clear, but bring in New and 
> Deltas, delete expired/404. All our data are basically webpages, so none are 
> very large. Some PDFs but again not too large. We are running Solr 7.5, 
> hopefully you can access the links.

Solr is saying that the entire size of the index directory is 95 MB for one of 
those indexes and the other is 30 MB.  Those sound to me like very small 
indexes, not very large like you indicated.  You were saying that the large 
files were in data/index, and did not mention anything about index. 
directories.

If you do have a bunch of index. directories in the "Data" 
directory mentioned on the Core overview page, you can safely delete all of the 
index and/or index.* directories under that directory EXCEPT the one that is 
indicated as the "Index" directory.  If you delete that one, you're deleting 
the actual live index ... and since you're not on Windows, the OS will let you 
delete it without complaining.

The directory locations are cut off on both screenshots, so I can't confirm 
anything there.

The larger core has about 2000 deleted docs and the smaller one has 40. 
Doing an optimize will not save much disk space or take very long.

Thanks,
Shawn
*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Re: Indexed Data Size

2019-08-09 Thread Shawn Heisey

On 8/9/2019 6:12 AM, Moyer, Brett wrote:

Thanks! We update each index nightly, we don’t clear, but bring in New and 
Deltas, delete expired/404. All our data are basically webpages, so none are 
very large. Some PDFs but again not too large. We are running Solr 7.5, 
hopefully you can access the links.


Solr is saying that the entire size of the index directory is 95 MB for 
one of those indexes and the other is 30 MB.  Those sound to me like 
very small indexes, not very large like you indicated.  You were saying 
that the large files were in data/index, and did not mention anything 
about index. directories.


If you do have a bunch of index. directories in the "Data" 
directory mentioned on the Core overview page, you can safely delete all 
of the index and/or index.* directories under that directory EXCEPT the 
one that is indicated as the "Index" directory.  If you delete that one, 
you're deleting the actual live index ... and since you're not on 
Windows, the OS will let you delete it without complaining.


The directory locations are cut off on both screenshots, so I can't 
confirm anything there.


The larger core has about 2000 deleted docs and the smaller one has 40. 
Doing an optimize will not save much disk space or take very long.


Thanks,
Shawn


RE: Indexed Data Size

2019-08-09 Thread Moyer, Brett
Thanks! We update each index nightly, we don’t clear, but bring in New and 
Deltas, delete expired/404. All our data are basically webpages, so none are 
very large. Some PDFs but again not too large. We are running Solr 7.5, 
hopefully you can access the links.

https://www.dropbox.com/s/lzd6hkoikhagujs/CoreOne.png?dl=0
https://www.dropbox.com/s/ae6rayb38q39u9c/CoreTwo.png?dl=0

Brett
-Original Message-
From: Erick Erickson  
Sent: Thursday, August 8, 2019 5:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexed Data Size

On the surface, this makes no sense at all, so there’s something I don’t 
understand here ;). 

How often do you update your index? Having files from a long time ago is 
perfectly reasonable if you’re not updating regularly.

But your statement that some of these are huge for just a 50K document index is 
odd unless they’re _huge_ documents.

I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single 
segment, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

The extensions you mentioned are perfectly reasonable. Each segment is made up 
of multiple files. .fdt for instance contains stored data. See: 
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html

Can you give us a long listing of one of your index directories?

Best,
Erick

> On Aug 8, 2019, at 5:17 PM, Moyer, Brett  wrote:
> 
> In our data/solr//data/index on the filesystem, we have files 
> that go back 1 year. I don’t understand why and I doubt they are in use. 
> Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are 
> very large and running us out of server space. Our search indexes themselves 
> are not large, in total we might have 50k documents.  How can I reduce this 
> /data/solr space? Is this what the Solr Optimize command is for? Thanks!
> 
> Brett
> 
> **
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> **
> ***

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA
*


Re: Indexed Data Size

2019-08-08 Thread Shawn Heisey

On 8/8/2019 3:17 PM, Moyer, Brett wrote:

In our data/solr//data/index on the filesystem, we have files 
that go back 1 year. I don’t understand why and I doubt they are in use. Files with 
extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running 
us out of server space. Our search indexes themselves are not large, in total we 
might have 50k documents.  How can I reduce this /data/solr space? Is this what the 
Solr Optimize command is for? Thanks!


+1 to everything Erick said.

Another piece of information that could be helpful is a screenshot of 
the core overview in the admin UI.  It would look something like this:


https://www.dropbox.com/s/mbh6ll1v8ghloko/solr-core-overview.png?dl=0

To get that, just go to the admin UI and choose one of the big cores 
from the core dropdown.  That should put you on the overview tab for the 
core.  Then grab a screenshot and use a file sharing site to share it.


Thanks,
Shawn


Re: Indexed Data Size

2019-08-08 Thread Erick Erickson
On the surface, this makes no sense at all, so there’s something I don’t 
understand here ;). 

How often do you update your index? Having files from a long time ago is 
perfectly reasonable if you’re not updating regularly.

But your statement that some of these are huge for just a 50K document index is 
odd unless they’re _huge_ documents.

I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single 
segment, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

The extensions you mentioned are perfectly reasonable. Each segment is made up 
of multiple files. .fdt for instance contains stored data. See: 
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html

Can you give us a long listing of one of your index directories?

Best,
Erick

> On Aug 8, 2019, at 5:17 PM, Moyer, Brett  wrote:
> 
> In our data/solr//data/index on the filesystem, we have files 
> that go back 1 year. I don’t understand why and I doubt they are in use. 
> Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are 
> very large and running us out of server space. Our search indexes themselves 
> are not large, in total we might have 50k documents.  How can I reduce this 
> /data/solr space? Is this what the Solr Optimize command is for? Thanks!
> 
> Brett
> 
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
> 
> TIAA
> *