Denormalize. Add media_set_id to each sentence document. Done. wunder
On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: > I'm setting up SolrCloud with around 600 million documents. The basic > structure of each document is: > > stories_id: integer, media_id: integer, sentence: text_en > > We have a number of stories from different media and we treat each sentence > as a separate document because we need to run sentence level analytics. > > We also have a concept of groups or sets of sources. We've imported this > media source to media sets mapping into Solr using the following structure: > > media_id_inner: integer, media_sets_id: integer > > For the single node case, we're able to filter our sources by media_set_id > using a join query like the following: > > http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1<http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1> > > However, this does not work correctly with SolrCloud. The problem is that > the join query is performed separately on each of the shards and no shard > has the complete media set to source mapping data. So SolrCloud returns > incomplete results. > > Since the complete media set to source mapping data is comparatively small > (~50,000 rows), I would like to replicate it on every shard. So that the > results of the individual join queries on separate shards would be > equivalent to performing the same query on a single shard system. > > However, I'm can't figure out how to replicate documents on separate > shards. The compositeID router has the ability to colocate documents based > on a prefix in the document ID but this isn't what I need. What I would > like is some way to either have the media set to source data replicated on > every shard or to be able to explicitly upload this data to the individual > shards. (For the rest of the data I like the compositeID autorouting.) > > Any suggestions? > > -- > > Thanks, > > > David -- Walter Underwood wun...@wunderwood.org