Re: Indexing the same data in many records

Otis Gospodnetic Wed, 14 Jan 2009 19:35:39 -0800

Phil,

Note that adding the same document multiple times and looking at the index size 
is not a very good approach.  You are adding a fixed number of distinct terms 
over and over.  In real-life scenario you will have a much greater term 
distribution, and that will affect index size.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: philmccarthy <philmccar...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 14, 2009 7:36:38 PM
> Subject: Re: Indexing the same data in many records
> 
> 
> Thanks Otis. I tweaked the Solr example app a little and then uploaded a
> ~55KB document to it a couple of thousand times (changing the ID each time).
> The solr/data directory was 72MB on disc after adding the document 2000
> times, so it seems that the index is growing by approximately 36KB for each
> document. That seems reasonable.
> 
> I guess I need to do some research into expected data volumes now, and
> limits on Lucene index size.
> 
> Cheers,
> Phil
> 
> 
> Otis Gospodnetic wrote:
> > 
> > Phil,
> > 
> > From what you described so far, I don't see any red flags.  I would pay
> > attention to reading those timestamps (covered on the Wiki and ML
> > archives), that's all.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> >> From: philmccarthy 
> >> To: solr-user@lucene.apache.org
> >> Sent: Tuesday, January 13, 2009 8:49:33 PM
> >> Subject: Indexing the same data in many records
> >> 
> >> 
> >> Hi,
> >> 
> >> I'd like to use Solr to index some webserver logs, in order to allow easy
> >> ad-hoc querying and analysis. Each Solr Document will represent a single
> >> request to the webserver, with fields for time, request URL, referring
> >> URL
> >> etc.
> >> 
> >> I'm also planning to fetch the page source of each referring URL, and add
> >> that as an indexed field in the Solr document. The aim is to allow
> >> queries
> >> like "find hits to /xyz.html where the referring page contains the word
> >> 'foobar'".
> >> 
> >> Since hundreds or even thousands of hits may all come from the same
> >> referring page, would this approach be horribly inefficient? (Note the
> >> page
> >> source won't be stored in each Document, just indexed). Am I going to
> >> dramatically increase the index size if I do this?
> >> 
> >> If so, is there a more elegant way to do what I want?
> >> 
> >> Many thanks,
> >> Phil
> >> 
> >> 
> >> 
> >> -- 
> >> View this message in context: 
> >> 
> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing the same data in many records

Reply via email to