Re: Get the new terms of fields since last update

Sujit Pal Fri, 05 Dec 2014 07:59:14 -0800

Hi Ludovic,

A bit late to the party, sorry, but here is a bit of a riff off Eric's
idea. Why not store the previous terms in a Bloom filter and once you get
the terms from this week, check to see if they are not in the set. Once you
find the set, add them to the Bloom filter. Bloom filters are space
efficient, by increasing the false positive rate you can make it consume
less space (more keys hash to the same element), since you are only
concerned with finding if something is not in the set.


-sujit

On Fri, Dec 5, 2014 at 7:21 AM, lboutros <boutr...@gmail.com> wrote:

> The Apache Solr community is sooo great !
>
> Interesting problem with 3 interesting answers in less than 2 hours !
>
> Thank you all, really.
>
> Erik,
>
> I'm already saving the billion of terms each week. It's hard to diff 1
> billion of terms.
> I'm already rebuilding the whole dictionaries each week in a custom
> distributed terms query handler.
>
> I'm saving the result in Mongo DB in order to scroll thru it quickly with
> term position in the dictionary.
>
> It takes 3-4 hours each week. Now I would like to update the result in
> order
> to do it faster.
>
> Alex, I will check, this seems to be a good idea.
> Is it possible to filter terms with payloads in index readers ? I did not
> see anything like that in my first investigation.
> I suppose it would take some additional disk space.
>
> Michael,
>
> this is the easiest way to do it. You are right. But I'm not sure that
> indexing twice and update the dictionaries would be faster than the current
> process. But it worth it to do some math ;)
>
> Ludovic.
>
>
>
>
>
> -----
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Get-the-new-terms-of-fields-since-last-update-tp4172755p4172785.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Get the new terms of fields since last update

Reply via email to