Re: Date faceting - howto improve performance

Ning Li Mon, 27 Apr 2009 08:09:59 -0700

You mean doc A and doc B will become one doc after adding index 2 to
index 1? I don't think this is currently supported either at Lucene
level or at Solr level. If index 1 has m docs and index 2 has n docs,
index 1 will have m+n docs after adding index 2 to index 1. Documents
themselves are not modified by index merge.


Cheers,
Ning


On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou
<marcus.he...@tailsweep.com> wrote:
> Hmm looking in the code for the IndexMerger in Solr
> (org.apache.solr.update.DirectUpdateHandler(2)
>
> See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
> indexes) ?
>
> And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
> suggests:
> add doc A to index1 with id=AAA,name=core1
> add doc B to index2 with id=BBB,name=core2
> merge the two indexes into one index which then contains both docs.
> The resulting index will have 2 docs.
>
> Great but in my case I think it should work more like this.
>
> add doc A to index1 with id=X,title=blog entry title,description=blog entry
> description
> add doc B to index2 with id=X,score=1.2
> somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
> The resulting index should have 1 doc.
>
> So this is not really what I want right ?
>
> Sorry for being a smart-ass...
>
> Kindly
>
> //Marcus
>
>
>
>
>
> On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou 
> <marcus.he...@tailsweep.com>wrote:
>
>> Guys!
>>
>> Thanks for these insights, I think we will head for Lucene level merging
>> strategy (two or more indexes).
>> When merging I guess the second index need to have the same doc ids
>> somehow. This is an internal id in Lucene, not that easy to get hold of
>> right ?
>>
>> So you are saying the the solr: ExternalFileField + FunctionQuery stuff
>> would not work very well performance wise or what do you mean ?
>>
>> I sure like bleeding edge :)
>>
>> Cheers dudes
>>
>> //Marcus
>>
>>
>>
>>
>>
>> On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic <
>> otis_gospodne...@yahoo.com> wrote:
>>
>>>
>>> I should emphasize that the PR trick I mentioned is something you'd do at
>>> the Lucene level, outside Solr, and then you'd just slip the modified index
>>> back into Solr.
>>> Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
>>> Solr index merging functionality (patch in JIRA).
>>>
>>>
>>> Otis --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>> > From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>>> > To: solr-user@lucene.apache.org
>>> > Sent: Saturday, April 25, 2009 9:41:45 AM
>>> > Subject: Re: Date faceting - howto improve performance
>>> >
>>> >
>>> > Yes, you could simply round the date, no need for a non-date type field.
>>> > Yes, you can add a field after the fact by making use of ParallelReader
>>> and
>>> > merging (I don't recall the details, search the ML for ParallelReader
>>> and
>>> > Andrzej), I remember he once provided the working recipe.
>>> >
>>> >
>>> > Otis --
>>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >
>>> >
>>> >
>>> > ----- Original Message ----
>>> > > From: Marcus Herou
>>> > > To: solr-user@lucene.apache.org
>>> > > Sent: Saturday, April 25, 2009 6:54:02 AM
>>> > > Subject: Date faceting - howto improve performance
>>> > >
>>> > > Hi.
>>> > >
>>> > > One of our faceting use-cases:
>>> > > We are creating trend graphs of how many blog posts that contains a
>>> certain
>>> > > term and groups it by day/week/year etc. with the nice DateMathParser
>>> > > functions.
>>> > >
>>> > > The performance degrades really fast and consumes a lot of memory
>>> which
>>> > > forces OOM from time to time
>>> > > We think it is due the fact that the cardinality of the field
>>> publishedDate
>>> > > in our index is huge, almost equal to the nr of documents in the
>>> index.
>>> > >
>>> > > We need to address that...
>>> > >
>>> > > Some questions:
>>> > >
>>> > > 1. Can a datefield have other date-formats than the default of
>>> yyyy-MM-dd
>>> > > HH:mm:ssZ ?
>>> > >
>>> > > 2. We are thinking of adding a field to the index which have the
>>> format
>>> > > yyyy-MM-dd to reduce the cardinality, if that field can't be a date,
>>> it
>>> > > could perhaps be a string, but the question then is if faceting can be
>>> used
>>> > > ?
>>> > >
>>> > > 3. Since we now already have such a huge index, is there a way to add
>>> a
>>> > > field afterwards and apply it to all documents without actually
>>> reindexing
>>> > > the whole shebang ?
>>> > >
>>> > > 4. If the field cannot be a string can we just leave out the
>>> > > hour/minute/second information and to reduce the cardinality and
>>> improve
>>> > > performance ? Example: 2009-01-01 00:00:00Z
>>> > >
>>> > > 5. I am afraid that we need to reindex everything to get this to work
>>> > > (negates Q3). We have 8 shards as of current, what would the most
>>> efficient
>>> > > way be to reindexing the whole shebang ? Dump the entire database to
>>> disk
>>> > > (sigh), create many xml file splits and use curl in a
>>> > > random/hash(numServers) manner on them ?
>>> > >
>>> > >
>>> > > Kindly
>>> > >
>>> > > //Marcus
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Marcus Herou CTO and co-founder Tailsweep AB
>>> > > +46702561312
>>> > > marcus.he...@tailsweep.com
>>> > > http://www.tailsweep.com/
>>> > > http://blogg.tailsweep.com/
>>>
>>>
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.he...@tailsweep.com
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>>
>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> marcus.he...@tailsweep.com
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>

Re: Date faceting - howto improve performance

Reply via email to