I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is:
stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1<http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1> However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David