On Tue, 2015-01-20 at 15:41 +0100, "Jürgen Wagner (DVT)" wrote: [Snip: Valid concerns]
> 3. Cardinality: there may be rather large collections and some smaller > collections in the federation. If you use SolrCloud to obtain results, > the ones from smaller collections will get more significance in the > result mixing than the ones from the larger collections, as relevance > will be relative to each federated source. The math might be solvable or at least fuzzy solvable: SOLR-1632 takes care of unifying term stats and site-specific boosts, defined in the merger, can compensate somewhat for overall score-adjustments from the different sites. > 4. Uniqueness: different systems may index the same documents. The > idea of having a globally unique identifier should take this into > account, i.e., it won't suffice to simply prefix each (locally unique) > document id with a source identifier. The federated sources must be > aware of being federated and possibly having overlaps. Otherwise, you > will get multiple occurrences of very popular documents. Different sources might have different meta-data on the same entity. Some sort of nearly-duplicate-document-merge might be preferable. > > 6. Orchestration: there will be some issues with the orchestration of > these services. Zookeeper won't scale to the multiple datacenter > topology, effectively leaving node discovery to some other mechanism > yet to be defined. If the nodes are locally run proxies exposed as a Solr shard, the connection details will be de-coupled from ZooKeeper. That would also allow for mapping of field names & values and similar site-specific adjustments of requests & queries. > In my experience, there is a clear distinction between "technical" > federated search (possibly something like the tribe nodes) and > "semantic" federated search (requiring special processing of results > obtained from different sources, ready to be consolidated). We have spend a fair amount of time getting semantic federated search (we call it "integrated search") to work across our sources. The raw requesting & merging is not too hard: Most of the development time has been spend mapping values and adjusting how the merger should order the documents. - Toke Eskildsen, State and University Library, Denmark
