Denormalize. Add media_set_id to each sentence document. Done.


On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:

> I'm setting up SolrCloud with around 600 million documents. The basic
> structure of each document is:
> stories_id: integer, media_id: integer, sentence: text_en
> We have a number of stories from different media and we treat each sentence
> as a separate document because we need to run sentence level analytics.
> We also have a concept of groups or sets of sources. We've imported this
> media source to media sets mapping into Solr using the following structure:
> media_id_inner: integer, media_sets_id: integer
> For the single node case, we're able to filter our sources by media_set_id
> using a join query like the following:
> http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1<http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1>
> However, this does not work correctly with SolrCloud. The problem is that
> the join query is performed separately on each of the shards and no shard
> has the complete media set to source mapping data. So SolrCloud returns
> incomplete results.
> Since the complete media set to source mapping data is comparatively small
> (~50,000 rows), I would like to replicate it on every shard. So that the
> results of the individual join queries on separate shards would  be
> equivalent to performing the same query on a single shard system.
> However, I'm can't figure out how to replicate documents on separate
> shards. The compositeID router has the ability to colocate documents based
> on a prefix in the document ID but this isn't what I need. What I would
> like is some way to either have the media set to source data replicated on
> every shard or to be able to explicitly upload this data to the individual
> shards. (For the rest of the data I like the compositeID autorouting.)
> Any suggestions?
> --
> Thanks,
> David

Walter Underwood

Reply via email to