Hello Charlie, theoretically, things may work as you describe them. A few big HOWEVERs exist as far as I can see:
1. Attributes: as different organisations may use different schemata (document attributes), the consolidation of results from multiple sources may present a problem. This may not arise with common attributes (for which there may be a standardization of some sort, e.g., like the Dublin meta-core standard), but especially for very specific attributes that pertain to the different focal work areas of the institutions running the individual systems you want to federate. 2. Values: different organisations will work on different topics. There may be large similarities, but as the staff involved is different, there will be an inherent difference in the actual semantic domain dealt with. Consequently, it is very likely that you won't have a homogeneous ontology for all pieces of information across all federated sources. This makes it hard to consolidate results in a semantically correct way. 3. Cardinality: there may be rather large collections and some smaller collections in the federation. If you use SolrCloud to obtain results, the ones from smaller collections will get more significance in the result mixing than the ones from the larger collections, as relevance will be relative to each federated source. 4. Uniqueness: different systems may index the same documents. The idea of having a globally unique identifier should take this into account, i.e., it won't suffice to simply prefix each (locally unique) document id with a source identifier. The federated sources must be aware of being federated and possibly having overlaps. Otherwise, you will get multiple occurrences of very popular documents. 5. Security: security in SolrCloud is through filtering. If you simply use the SolrCould distributed query mechanism, each source would have to trust each federation instance to properly enforce security filters through the respective entitlement groups. If one such federation system won't comply and simply issue wild queries, there won't be any security. 6. Orchestration: there will be some issues with the orchestration of these services. Zookeeper won't scale to the multiple datacenter topology, effectively leaving node discovery to some other mechanism yet to be defined. These are the issues that quickly come to my mind. There may be more. Also have a look at tribe nodes in Elasticsearch, although these don't fully address all issues I listed above. In my experience, there is a clear distinction between "technical" federated search (possibly something like the tribe nodes) and "semantic" federated search (requiring special processing of results obtained from different sources, ready to be consolidated). FAST Unity used to have elaborate (but still limited) mechanisms to handle this, but they disappeared in the course of the Microsoft takeover. Best regards, --Jürgen On 20.01.2015 15:13, Charlie Hull wrote: > Hi all, > > We've been discussing a way of implementing a federated search by > leveraging the distributed query parts of SolrCloud. I've written this up > at > http://www.flax.co.uk/blog/2015/01/20/solr-superclusters-for-improved-federated-search/ > and would welcome any comments or feedback. So far, two committers have > failed to see any major flaw in our plan, which makes me slightly nervous :) > > cheers > > Charlie > -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center "Intelligence" & Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: [email protected] <mailto:[email protected]>, URL: www.devoteam.de <http://www.devoteam.de/> ------------------------------------------------------------------------ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
