On Tue, 2015-01-20 at 15:41 +0100, "Jürgen Wagner (DVT)" wrote:

[Snip: Valid concerns]

> 3. Cardinality: there may be rather large collections and some smaller
> collections in the federation. If you use SolrCloud to obtain results,
> the ones from smaller collections will get more significance in the
> result mixing than the ones from the larger collections, as relevance
> will be relative to each federated source.

The math might be solvable or at least fuzzy solvable: SOLR-1632 takes
care of unifying term stats and site-specific boosts, defined in the
merger, can compensate somewhat for overall score-adjustments from the
different sites.

> 4. Uniqueness: different systems may index the same documents. The
> idea of having a globally unique identifier should take this into
> account, i.e., it won't suffice to simply prefix each (locally unique)
> document id with a source identifier. The federated sources must be
> aware of being federated and possibly having overlaps. Otherwise, you
> will get multiple occurrences of very popular documents.

Different sources might have different meta-data on the same entity.
Some sort of nearly-duplicate-document-merge might be preferable.
> 
> 6. Orchestration: there will be some issues with the orchestration of
> these services. Zookeeper won't scale to the multiple datacenter
> topology, effectively leaving node discovery to some other mechanism
> yet to be defined.

If the nodes are locally run proxies exposed as a Solr shard, the
connection details will be de-coupled from ZooKeeper. That would also
allow for mapping of field names & values and similar site-specific
adjustments of requests & queries.

> In my experience, there is a clear distinction between "technical" 
> federated search (possibly something like the tribe nodes) and 
> "semantic" federated search (requiring special processing of results 
> obtained from different sources, ready to be consolidated).

We have spend a fair amount of time getting semantic federated search
(we call it "integrated search") to work across our sources. The raw
requesting & merging is not too hard: Most of the development time has
been spend mapping values and adjusting how the merger should order the
documents.

- Toke Eskildsen, State and University Library, Denmark



Reply via email to