A join may seem clean, but it will be slow and (currently) doesn't work in a 
cluster.

You find all the sentences in a media set by searching for that set id and 
requesting only the sentence_id (yes, you need that). Then you reindex them. 
With small documents like this, it is probably fairly fast.

If you can't estimate how often the media sets will change or the size of the 
changes, then you aren't ready to choose a design.

wunder

On Jul 29, 2013, at 8:41 AM, David Larochelle wrote:

> We'd like to be able to easily update the media set to source mapping. I'm
> concerned that if we store the media_sets_id in the sentence documents, it
> will be very difficult to add additional media set to source mapping. I
> imagine that adding a new media set would either require reimporting all
> 600 million documents or writing complicated application logic to find out
> which sentences to update. Hence joins seem like a cleaner solution.
> 
> --
> 
> David
> 
> 
> On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood 
> <wun...@wunderwood.org>wrote:
> 
>> Denormalize. Add media_set_id to each sentence document. Done.
>> 
>> wunder
>> 
>> On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:
>> 
>>> I'm setting up SolrCloud with around 600 million documents. The basic
>>> structure of each document is:
>>> 
>>> stories_id: integer, media_id: integer, sentence: text_en
>>> 
>>> We have a number of stories from different media and we treat each
>> sentence
>>> as a separate document because we need to run sentence level analytics.
>>> 
>>> We also have a concept of groups or sets of sources. We've imported this
>>> media source to media sets mapping into Solr using the following
>> structure:
>>> 
>>> media_id_inner: integer, media_sets_id: integer
>>> 
>>> For the single node case, we're able to filter our sources by
>> media_set_id
>>> using a join query like the following:
>>> 
>>> 
>> http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1
>> <
>> http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
>>> 
>>> 
>>> However, this does not work correctly with SolrCloud. The problem is that
>>> the join query is performed separately on each of the shards and no shard
>>> has the complete media set to source mapping data. So SolrCloud returns
>>> incomplete results.
>>> 
>>> Since the complete media set to source mapping data is comparatively
>> small
>>> (~50,000 rows), I would like to replicate it on every shard. So that the
>>> results of the individual join queries on separate shards would  be
>>> equivalent to performing the same query on a single shard system.
>>> 
>>> However, I'm can't figure out how to replicate documents on separate
>>> shards. The compositeID router has the ability to colocate documents
>> based
>>> on a prefix in the document ID but this isn't what I need. What I would
>>> like is some way to either have the media set to source data replicated
>> on
>>> every shard or to be able to explicitly upload this data to the
>> individual
>>> shards. (For the rest of the data I like the compositeID autorouting.)
>>> 
>>> Any suggestions?
>>> 
>>> --
>>> 
>>> Thanks,
>>> 
>>> 
>>> David
>> 
>> --
>> Walter Underwood
>> wun...@wunderwood.org
>> 
>> 
>> 
>> 

--
Walter Underwood
wun...@wunderwood.org



Reply via email to