What if we do not care about the version of a document at index-time? When it comes to distributed search, we currently decide aggregating documents based on their uniqueKey. But what would be, if we decide additionally decide on uniqueKey plus indexingDate, so that we only aggregate the last indexed version of a document?
The concept could look like this: When Solr aggregated the documents for a response, it could store what shard responsed an older version of document x. Now a crawler can crawl through our SolrCloud and asking each shard whether it noticed something like "shard y got an older version of doc x"-case. The crawler aggregates those information. After he finished crawling, he sends delete-by-query-requests to those shards which have older versions of documents than they should have. I will call these "stores document versions that are older than the newest version" ODV (Old Document Versions) for better understanding. So, what can happen: Before the crawler can visit shard A - who noticed that shard y stores an ODV of doc x - shard A can go down. That's okay, because either another shard noticed the same, or shard A will be available later on. If those information will we stored at HD, it will also be available. If it was stored in RAM the information is lost... however, you could replicate those information over more than one shard, right? :-) Another case: Shard y can go down - so someone has to care for storing the noticed ODV-information, so that one can delete the document when Shard Y comes back. Pros: - You can do something like consistent hashing in connection with a concept where each node has to care for its neighbour-nodes. This is because only the neighbour nodes can store ODVs. - using the described concept, you can do nightly batches, looking for ODVs in the neigbour-nodes. - ODVs will be found at requesting time, so we can avoid to response ODVs over newer versions. Cons: - We are wasting disc space. - This works only for smaller clusters, not for large ones where the number of machines changes very frequently ... this is just another idea - and it is very very lazy. I must emphasize, that I assume that neighbour-machines do not go down very frequently. Of course, it is not a question whether a machine crashes, but when it crashes - but I assume that the same server does not crash every hour. :-) Thoughts? Kind regards Andrzej Bialecki wrote: > > On 2010-09-06 16:41, Yonik Seeley wrote: >> On Mon, Sep 6, 2010 at 10:18 AM, MitchK<mitc...@web.de> wrote: >> [...consistent hashing...] >>> But it doesn't solve the problem at all, correct me if I am wrong, but: >>> If >>> you add a new server, let's call him IP3-1, and IP3-1 is nearer to the >>> current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 >>> holds the older version. >>> Am I right? >> >> Right. You still need code to handle migration. >> >> Consistent hashing is a way for everyone to be able to agree on the >> mapping, and for the mapping to change incrementally. i.e. you add a >> node and it only changes the docid->node mapping of a limited percent >> of the mappings, rather than changing the mappings of potentially >> everything, as a simple MOD would do. > > Another strategy to avoid excessive reindexing is to keep splitting the > largest shards, and then your mapping becomes a regular MOD plus a list > of these additional splits. Really, there's an infinite number of ways > you could implement this... > >> >> For SolrCloud, I don't think we'll end up using consistent hashing - >> we don't need it (although some of the concepts may still be useful). > > I imagine there could be situations where a simple MOD won't do ;) so I > think it would be good to hide this strategy behind an > interface/abstract class. It costs nothing, and gives you flexibility in > how you implement this mapping. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p1434329.html Sent from the Solr - User mailing list archive at Nabble.com.