We'd like to be able to easily update the media set to source mapping. I'm
concerned that if we store the media_sets_id in the sentence documents, it
will be very difficult to add additional media set to source mapping. I
imagine that adding a new media set would either require reimporting all
600 million documents or writing complicated application logic to find out
which sentences to update. Hence joins seem like a cleaner solution.

--

David


On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood <wun...@wunderwood.org>wrote:

> Denormalize. Add media_set_id to each sentence document. Done.
>
> wunder
>
> On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:
>
> > I'm setting up SolrCloud with around 600 million documents. The basic
> > structure of each document is:
> >
> > stories_id: integer, media_id: integer, sentence: text_en
> >
> > We have a number of stories from different media and we treat each
> sentence
> > as a separate document because we need to run sentence level analytics.
> >
> > We also have a concept of groups or sets of sources. We've imported this
> > media source to media sets mapping into Solr using the following
> structure:
> >
> > media_id_inner: integer, media_sets_id: integer
> >
> > For the single node case, we're able to filter our sources by
> media_set_id
> > using a join query like the following:
> >
> >
> http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1
> <
> http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
> >
> >
> > However, this does not work correctly with SolrCloud. The problem is that
> > the join query is performed separately on each of the shards and no shard
> > has the complete media set to source mapping data. So SolrCloud returns
> > incomplete results.
> >
> > Since the complete media set to source mapping data is comparatively
> small
> > (~50,000 rows), I would like to replicate it on every shard. So that the
> > results of the individual join queries on separate shards would  be
> > equivalent to performing the same query on a single shard system.
> >
> > However, I'm can't figure out how to replicate documents on separate
> > shards. The compositeID router has the ability to colocate documents
> based
> > on a prefix in the document ID but this isn't what I need. What I would
> > like is some way to either have the media set to source data replicated
> on
> > every shard or to be able to explicitly upload this data to the
> individual
> > shards. (For the rest of the data I like the compositeID autorouting.)
> >
> > Any suggestions?
> >
> > --
> >
> > Thanks,
> >
> >
> > David
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>
>

Reply via email to