We'd like to be able to easily update the media set to source mapping. I'm concerned that if we store the media_sets_id in the sentence documents, it will be very difficult to add additional media set to source mapping. I imagine that adding a new media set would either require reimporting all 600 million documents or writing complicated application logic to find out which sentences to update. Hence joins seem like a cleaner solution.
-- David On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood <wun...@wunderwood.org>wrote: > Denormalize. Add media_set_id to each sentence document. Done. > > wunder > > On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: > > > I'm setting up SolrCloud with around 600 million documents. The basic > > structure of each document is: > > > > stories_id: integer, media_id: integer, sentence: text_en > > > > We have a number of stories from different media and we treat each > sentence > > as a separate document because we need to run sentence level analytics. > > > > We also have a concept of groups or sets of sources. We've imported this > > media source to media sets mapping into Solr using the following > structure: > > > > media_id_inner: integer, media_sets_id: integer > > > > For the single node case, we're able to filter our sources by > media_set_id > > using a join query like the following: > > > > > http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1 > < > http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 > > > > > > However, this does not work correctly with SolrCloud. The problem is that > > the join query is performed separately on each of the shards and no shard > > has the complete media set to source mapping data. So SolrCloud returns > > incomplete results. > > > > Since the complete media set to source mapping data is comparatively > small > > (~50,000 rows), I would like to replicate it on every shard. So that the > > results of the individual join queries on separate shards would be > > equivalent to performing the same query on a single shard system. > > > > However, I'm can't figure out how to replicate documents on separate > > shards. The compositeID router has the ability to colocate documents > based > > on a prefix in the document ID but this isn't what I need. What I would > > like is some way to either have the media set to source data replicated > on > > every shard or to be able to explicitly upload this data to the > individual > > shards. (For the rest of the data I like the compositeID autorouting.) > > > > Any suggestions? > > > > -- > > > > Thanks, > > > > > > David > > -- > Walter Underwood > wun...@wunderwood.org > > > >