When I need to investigate my index I use Luke, saved my bacon lots of times....
http://code.google.com/p/luke/ Many Thanks Rippo -----Original Message----- From: o...@diffdoof.com [mailto:o...@diffdoof.com] On Behalf Of Omri Suissa Sent: 18 December 2012 15:21 To: Simon Svensson Cc: user@lucenenet.apache.org Subject: Re: Why is my index so large? Hi, I'm terribly sorry for wasting your time, I found the problem in my files crawler, I read the same document several times and a 6MB document becomes 400MB text. Thanks again, Omri On Tue, Dec 18, 2012 at 12:16 PM, Simon Svensson <si...@devhost.se> wrote: > Hi, > > Are you able to share those document with us? Perhaps a giant zip > archive with both documents and code? > > A common problem with checking index sizes is an old opened reader > which locks the old files, so they cant be deleted. Do you have any open readers? > Are you using any specific deletion- or merge policies? Can you show > us the code which creates your IndexWriter instance? > > // Simon > > > On 2012-12-18 10:54, Omri Suissa wrote: > >> Hi, >> Sorry for my late response, i'm still strgling this problem... >> >> my code is looks like this (item the document to add to the index, >> EntityId >> (int) document id): >> ------------------------------**------------- >> Document doc = new Document(); >> >> doc.Add(new Field("entityId", item.EntityId.ToString(), >> Lucene.Net.Documents.Field.**Store.YES, >> Lucene.Net.Documents.Field.**Index.NOT_ANALYZED)); >> >> doc.Add(new Field("contentMain", item.Content, Field.Store.NO, >> Field.Index.ANALYZED, Field.TermVector.WITH_**POSITIONS_OFFSETS)); >> >> indexWriter.UpdateDocument(new >> Term(IndexConfigConsts.**FieldName_Main_EntityId, >> item.EntityId.ToString()), >> doc, new StandardAnalyzer(Lucene.Net.**Util.Version.LUCENE_30)); >> ------------------------------**------------------------ >> >> No SynonymAnalyzer, very simple.... my files size is ~150MB, my index >> size is ~280MB. why? >> >> *Omri Suissa **VP R&D* >> >> *Tel: +972 9 7724228 **DiffDoof .ltd** >> * >> >> *Cell: +972 54 5395206 **11, Galgaley Haplada >> Street, * >> >> *Fax: +972 9 9512577** P.O.Box 2150*** >> >> *www.DiffDoof.com* <http://www.DiffDoof.com>* >> * >> *Herzlia Pituach 46120, Israel* >> >> >> >> >> On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontis...@gmail.com> >> wrote: >> >> Perhaps you have a SynonymAnalyzer that are adding to the index the >>> synonyms tokens >>> >>> >>> >>> 2012/12/12 Simon Svensson <si...@devhost.se> >>> >>> Hi, >>>> >>>> That 20-30%-size-measurement sounds like a general estimation, and >>>> you may have specific data that does not conform to that >>>> measurement. But it sounds really odd getting an index which is >>>> 187% size of the original data. >>>> >>>> Could you show us your code which generates the large index? >>>> >>>> // Simon >>>> >>>> >>>> On 2012-12-10 09:27, Omri Suissa wrote: >>>> >>>> Hi all, >>>>> >>>>> I'm trying to index some files on a file server. I built a crawler >>>>> that runs over the folders and extract the text (using IFilters) >>>>> from office \ pdf files. >>>>> >>>>> The size of the files is ~150MB. >>>>> >>>>> I do not store the content. >>>>> >>>>> I store some additional fields per file. >>>>> >>>>> I'm using SnowballAnalyzer (English). >>>>> >>>>> As far as I know Lucene index should be around 20-30% of the size >>>>> of the text. >>>>> >>>>> When I index the files without indexing the content (only the >>>>> additional >>>>> fields) the index size (after optimization) is ~10MB (this is my >>>>> overhead). >>>>> >>>>> When I index the files including the content (but not stored) the >>>>> index size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10). >>>>> >>>>> Why? :) >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Omri >>>>> >>>>> >>>>> >>> -- >>> >>> http://stackoverflow.com/**users/690958/alberto-leon<http://stackove >>> rflow.com/users/690958/alberto-leon> >>> >>> >>> >