RE: Doubt about index size
An optimize takes lots of cpu and I/O since it has to rewrite your indexes, so only do it when necessary. You can just use curl to send an optimize message to Solr when you are ready. See: http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL Tom -Original Message- From: Claudio Devecchi [mailto:cdevec...@gmail.com] Sent: Friday, November 12, 2010 12:13 PM To: solr-user@lucene.apache.org Subject: Re: Doubt about index size Hi Tom, thanks for your explanation, Do you recommend the index continues this way? Or can I configure it to make optmize automatically? tks On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom wrote: > Hi Claudio, > > What's happening when you re-index the documents is that Solr/Lucene > implements an update as a delete plus a new index. Because of the nature of > inverted indexes, deleting documents requires a rewrite of the entire index. > In order to avoid rewriting the entire index each time one document is > deleted, deletes are implemented as a list of deleted internal lucene ids. > Documents aren't actually removed from the indexes until the index segment > is merged or an optimize occurs. > > maxDoc's is the total number of documents indexed without taking into > consideration that some of them are marked as deleted > numDocs is the actual number of undeleted documents > > If you run an optimize the index will be rewritten, the index size will go > down and numDocs will equal maxDocs > > Tom Burton-West > > -Original Message- > From: Claudio Devecchi [mailto:cdevec...@gmail.com] > Sent: Friday, November 12, 2010 10:50 AM > To: Lista Solr > Subject: Doubt about index size > > Hi everybody, > > I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one > thing, let me try to explain. > > I have 1.2 million xml files and I'm indexing then, when I do it for first > time my index size is around 3 GB and in my statistics on > http://localhost:8983/solr/admin/stats.jsp I have two entries that is: > > numDocs : 1120171 > maxDoc : 1120171 > > Until here is all right, but if I make a index update reindexing all the > same 1120171 documents I have the stats bellow: > > numDocs : 1120171 > maxDoc : 2240342 > > ... and my index size goes around 6GB. > > Why this happen? What happens on index size if I have the same number of > searcheable docs? > > Somebody knows? > > Tks > -- Claudio Devecchi flickr.com/cdevecchi
Re: Doubt about index size
It's probably a good idea to optimize. How are you re-indexing anyway? DIH? custom code? post.jar? Manual optimizing is just issuing the appropriate curl command, see: http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22 Best Erick On Fri, Nov 12, 2010 at 12:13 PM, Claudio Devecchi wrote: > Hi Tom, thanks for your explanation, > > Do you recommend the index continues this way? Or can I configure it to > make > optmize automatically? > > tks > > On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom >wrote: > > > Hi Claudio, > > > > What's happening when you re-index the documents is that Solr/Lucene > > implements an update as a delete plus a new index. Because of the nature > of > > inverted indexes, deleting documents requires a rewrite of the entire > index. > > In order to avoid rewriting the entire index each time one document is > > deleted, deletes are implemented as a list of deleted internal lucene > ids. > > Documents aren't actually removed from the indexes until the index > segment > > is merged or an optimize occurs. > > > > maxDoc's is the total number of documents indexed without taking into > > consideration that some of them are marked as deleted > > numDocs is the actual number of undeleted documents > > > > If you run an optimize the index will be rewritten, the index size will > go > > down and numDocs will equal maxDocs > > > > Tom Burton-West > > > > -Original Message- > > From: Claudio Devecchi [mailto:cdevec...@gmail.com] > > Sent: Friday, November 12, 2010 10:50 AM > > To: Lista Solr > > Subject: Doubt about index size > > > > Hi everybody, > > > > I'm doing some indexing testing on solr 1.4.1 and I'm not understanding > one > > thing, let me try to explain. > > > > I have 1.2 million xml files and I'm indexing then, when I do it for > first > > time my index size is around 3 GB and in my statistics on > > http://localhost:8983/solr/admin/stats.jsp I have two entries that is: > > > > numDocs : 1120171 > > maxDoc : 1120171 > > > > Until here is all right, but if I make a index update reindexing all the > > same 1120171 documents I have the stats bellow: > > > > numDocs : 1120171 > > maxDoc : 2240342 > > > > ... and my index size goes around 6GB. > > > > Why this happen? What happens on index size if I have the same number of > > searcheable docs? > > > > Somebody knows? > > > > Tks > > > > > > -- > Claudio Devecchi > flickr.com/cdevecchi >
Re: Doubt about index size
Hi Tom, thanks for your explanation, Do you recommend the index continues this way? Or can I configure it to make optmize automatically? tks On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom wrote: > Hi Claudio, > > What's happening when you re-index the documents is that Solr/Lucene > implements an update as a delete plus a new index. Because of the nature of > inverted indexes, deleting documents requires a rewrite of the entire index. > In order to avoid rewriting the entire index each time one document is > deleted, deletes are implemented as a list of deleted internal lucene ids. > Documents aren't actually removed from the indexes until the index segment > is merged or an optimize occurs. > > maxDoc's is the total number of documents indexed without taking into > consideration that some of them are marked as deleted > numDocs is the actual number of undeleted documents > > If you run an optimize the index will be rewritten, the index size will go > down and numDocs will equal maxDocs > > Tom Burton-West > > -Original Message- > From: Claudio Devecchi [mailto:cdevec...@gmail.com] > Sent: Friday, November 12, 2010 10:50 AM > To: Lista Solr > Subject: Doubt about index size > > Hi everybody, > > I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one > thing, let me try to explain. > > I have 1.2 million xml files and I'm indexing then, when I do it for first > time my index size is around 3 GB and in my statistics on > http://localhost:8983/solr/admin/stats.jsp I have two entries that is: > > numDocs : 1120171 > maxDoc : 1120171 > > Until here is all right, but if I make a index update reindexing all the > same 1120171 documents I have the stats bellow: > > numDocs : 1120171 > maxDoc : 2240342 > > ... and my index size goes around 6GB. > > Why this happen? What happens on index size if I have the same number of > searcheable docs? > > Somebody knows? > > Tks > -- Claudio Devecchi flickr.com/cdevecchi
RE: Doubt about index size
Hi Claudio, What's happening when you re-index the documents is that Solr/Lucene implements an update as a delete plus a new index. Because of the nature of inverted indexes, deleting documents requires a rewrite of the entire index. In order to avoid rewriting the entire index each time one document is deleted, deletes are implemented as a list of deleted internal lucene ids. Documents aren't actually removed from the indexes until the index segment is merged or an optimize occurs. maxDoc's is the total number of documents indexed without taking into consideration that some of them are marked as deleted numDocs is the actual number of undeleted documents If you run an optimize the index will be rewritten, the index size will go down and numDocs will equal maxDocs Tom Burton-West -Original Message- From: Claudio Devecchi [mailto:cdevec...@gmail.com] Sent: Friday, November 12, 2010 10:50 AM To: Lista Solr Subject: Doubt about index size Hi everybody, I'm doing some indexing testing on solr 1.4.1 and I'm not understanding one thing, let me try to explain. I have 1.2 million xml files and I'm indexing then, when I do it for first time my index size is around 3 GB and in my statistics on http://localhost:8983/solr/admin/stats.jsp I have two entries that is: numDocs : 1120171 maxDoc : 1120171 Until here is all right, but if I make a index update reindexing all the same 1120171 documents I have the stats bellow: numDocs : 1120171 maxDoc : 2240342 ... and my index size goes around 6GB. Why this happen? What happens on index size if I have the same number of searcheable docs? Somebody knows? Tks