I'm setting up SolrCloud with around 600 million documents. The basic
structure of each document is:

stories_id: integer, media_id: integer, sentence: text_en

We have a number of stories from different media and we treat each sentence
as a separate document because we need to run sentence level analytics.

We also have a concept of groups or sets of sources. We've imported this
media source to media sets mapping into Solr using the following structure:

media_id_inner: integer, media_sets_id: integer

For the single node case, we're able to filter our sources by media_set_id
using a join query like the following:

http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1<http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1>

However, this does not work correctly with SolrCloud. The problem is that
the join query is performed separately on each of the shards and no shard
has the complete media set to source mapping data. So SolrCloud returns
incomplete results.

Since the complete media set to source mapping data is comparatively small
(~50,000 rows), I would like to replicate it on every shard. So that the
results of the individual join queries on separate shards would  be
equivalent to performing the same query on a single shard system.

However, I'm can't figure out how to replicate documents on separate
shards. The compositeID router has the ability to colocate documents based
on a prefix in the document ID but this isn't what I need. What I would
like is some way to either have the media set to source data replicated on
every shard or to be able to explicitly upload this data to the individual
shards. (For the rest of the data I like the compositeID autorouting.)

Any suggestions?

--

Thanks,


David

Reply via email to