Hello Charlie,
  theoretically, things may work as you describe them. A few big
HOWEVERs exist as far as I can see:

1. Attributes: as different organisations may use different schemata
(document attributes), the consolidation of results from multiple
sources may present a problem. This may not arise with common attributes
(for which there may be a standardization of some sort, e.g., like the
Dublin meta-core standard), but especially for very specific attributes
that pertain to the different focal work areas of the institutions
running the individual systems you want to federate.

2. Values: different organisations will work on different topics. There
may be large similarities, but as the staff involved is different, there
will be an inherent difference in the actual semantic domain dealt with.
Consequently, it is very likely that you won't have a homogeneous
ontology for all pieces of information across all federated sources.
This makes it hard to consolidate results in a semantically correct way.

3. Cardinality: there may be rather large collections and some smaller
collections in the federation. If you use SolrCloud to obtain results,
the ones from smaller collections will get more significance in the
result mixing than the ones from the larger collections, as relevance
will be relative to each federated source.

4. Uniqueness: different systems may index the same documents. The idea
of having a globally unique identifier should take this into account,
i.e., it won't suffice to simply prefix each (locally unique) document
id with a source identifier. The federated sources must be aware of
being federated and possibly having overlaps. Otherwise, you will get
multiple occurrences of very popular documents.

5. Security: security in SolrCloud is through filtering. If you simply
use the SolrCould distributed query mechanism, each source would have to
trust each federation instance to properly enforce security filters
through the respective entitlement groups. If one such federation system
won't comply and simply issue wild queries, there won't be any security.

6. Orchestration: there will be some issues with the orchestration of
these services. Zookeeper won't scale to the multiple datacenter
topology, effectively leaving node discovery to some other mechanism yet
to be defined.

These are the issues that quickly come to my mind. There may be more.

Also have a look at tribe nodes in Elasticsearch, although these don't
fully address all issues I listed above.

In my experience, there is a clear distinction between "technical"
federated search (possibly something like the tribe nodes) and
"semantic" federated search (requiring special processing of results
obtained from different sources, ready to be consolidated). FAST Unity
used to have elaborate (but still limited) mechanisms to handle this,
but they disappeared in the course of the Microsoft takeover.

Best regards,
--Jürgen


On 20.01.2015 15:13, Charlie Hull wrote:
> Hi all,
>
> We've been discussing a way of implementing a federated search by
> leveraging the distributed query parts of SolrCloud. I've written this up
> at
> http://www.flax.co.uk/blog/2015/01/20/solr-superclusters-for-improved-federated-search/
> and would welcome any comments or feedback. So far, two committers have
> failed to see any major flaw in our plan, which makes me slightly nervous :)
>
> cheers
>
> Charlie
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [email protected]
<mailto:[email protected]>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


Reply via email to