Phil, Note that adding the same document multiple times and looking at the index size is not a very good approach. You are adding a fixed number of distinct terms over and over. In real-life scenario you will have a much greater term distribution, and that will affect index size.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: philmccarthy <philmccar...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Wednesday, January 14, 2009 7:36:38 PM > Subject: Re: Indexing the same data in many records > > > Thanks Otis. I tweaked the Solr example app a little and then uploaded a > ~55KB document to it a couple of thousand times (changing the ID each time). > The solr/data directory was 72MB on disc after adding the document 2000 > times, so it seems that the index is growing by approximately 36KB for each > document. That seems reasonable. > > I guess I need to do some research into expected data volumes now, and > limits on Lucene index size. > > Cheers, > Phil > > > Otis Gospodnetic wrote: > > > > Phil, > > > > From what you described so far, I don't see any red flags. I would pay > > attention to reading those timestamps (covered on the Wiki and ML > > archives), that's all. > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > >> From: philmccarthy > >> To: solr-user@lucene.apache.org > >> Sent: Tuesday, January 13, 2009 8:49:33 PM > >> Subject: Indexing the same data in many records > >> > >> > >> Hi, > >> > >> I'd like to use Solr to index some webserver logs, in order to allow easy > >> ad-hoc querying and analysis. Each Solr Document will represent a single > >> request to the webserver, with fields for time, request URL, referring > >> URL > >> etc. > >> > >> I'm also planning to fetch the page source of each referring URL, and add > >> that as an indexed field in the Solr document. The aim is to allow > >> queries > >> like "find hits to /xyz.html where the referring page contains the word > >> 'foobar'". > >> > >> Since hundreds or even thousands of hits may all come from the same > >> referring page, would this approach be horribly inefficient? (Note the > >> page > >> source won't be stored in each Document, just indexed). Am I going to > >> dramatically increase the index size if I do this? > >> > >> If so, is there a more elegant way to do what I want? > >> > >> Many thanks, > >> Phil > >> > >> > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html > Sent from the Solr - User mailing list archive at Nabble.com.