What if we do not care about the version of a document at index-time?

When it comes to distributed search, we currently decide aggregating
documents based on their uniqueKey. But what would be, if we decide
additionally decide on uniqueKey plus indexingDate, so that we only
aggregate the last indexed version of a document?

The concept could look like this:
When Solr aggregated the documents for a response, it could store what shard
responsed an older version of document x. 

Now a crawler can crawl through our SolrCloud and asking each shard whether
it noticed something like "shard y got an older version of doc x"-case.
The crawler aggregates those information. After he finished crawling, he
sends delete-by-query-requests to those shards which have older versions of
documents than they should have. 

I will call these "stores document versions that are older than the newest
version" ODV (Old Document Versions) for better understanding. 

So, what can happen:
Before the crawler can visit shard A - who noticed that shard y stores an
ODV of doc x - shard A can go down. That's okay, because either another
shard noticed the same, or shard A will be available later on. If those
information will we stored at HD, it will also be available. If it was
stored in RAM the information is lost... however, you could replicate those
information over more than one shard, right? :-)

Another case:
Shard y can go down - so someone has to care for storing the noticed
ODV-information, so that one can delete the document when Shard Y comes
back.

Pros:
- You can do something like consistent hashing in connection with a concept
where each node has to care for its neighbour-nodes. This is because only
the neighbour nodes can store ODVs.

- using the described concept, you can do nightly batches, looking for ODVs
in the neigbour-nodes.

- ODVs will be found at requesting time, so we can avoid to response ODVs
over newer versions.

Cons:
- We are wasting disc space.

- This works only for smaller clusters, not for large ones where the number
of machines changes very frequently

... this is just another idea - and it is very very lazy.

I must emphasize, that I assume that neighbour-machines do not go down very
frequently. Of course, it is not a question whether a machine crashes, but
when it crashes - but I assume that the same server does not crash every
hour. :-)

Thoughts?

Kind regards


Andrzej Bialecki wrote:
> 
> On 2010-09-06 16:41, Yonik Seeley wrote:
>> On Mon, Sep 6, 2010 at 10:18 AM, MitchK<mitc...@web.de>  wrote:
>> [...consistent hashing...]
>>> But it doesn't solve the problem at all, correct me if I am wrong, but:
>>> If
>>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the
>>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1
>>> holds the older version.
>>> Am I right?
>>
>> Right.  You still need code to handle migration.
>>
>> Consistent hashing is a way for everyone to be able to agree on the
>> mapping, and for the mapping to change incrementally.  i.e. you add a
>> node and it only changes the docid->node mapping of a limited percent
>> of the mappings, rather than changing the mappings of potentially
>> everything, as a simple MOD would do.
> 
> Another strategy to avoid excessive reindexing is to keep splitting the 
> largest shards, and then your mapping becomes a regular MOD plus a list 
> of these additional splits. Really, there's an infinite number of ways 
> you could implement this...
> 
>>
>> For SolrCloud, I don't think we'll end up using consistent hashing -
>> we don't need it (although some of the concepts may still be useful).
> 
> I imagine there could be situations where a simple MOD won't do ;) so I 
> think it would be good to hide this strategy behind an 
> interface/abstract class. It costs nothing, and gives you flexibility in 
> how you implement this mapping.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to