Re: Distributed Search

asdf qwerty Wed, 04 Mar 2009 23:28:50 -0800

> : > > Ok, so it wouldn't be possible to have a smaller, faster authoritative
> : > > shard for near-real-time updates while keeping the entire dataset in a
> : > > second shard which is updates less frequently?
>
> I believe Otis's point is that many people use distributed search across
> shards where some are large and mostly static and one is small and
> frequently updated with new docs in order to get some performance
> advantages out of hte long cache lifes on the larger shard(s) ... but this
> typically works best when you only "add" new docs, and don't modify old
> ones (or only modify docs added very recently so they're always in the
> small shard) while the bigger shards are treated as "archives" that don't
> change.
>
> To be deterministic you can't have the same uniqueKey in multiple shards.


Hmm, partitioning by document has a lot of merit, but having this be
(configurably) deterministic would seem to enable some interesting
features, such as simple 'tagging' by partitioning by document fields.

For example, you could have a large essentially read-only index of
documents and a separate small index for tags.  To tag a document, you
would create (or update) a document in the tag index containing the
uniqueKey from the main index as well as a multivalued tag field, and
whenever you search, you fire off a distributed search across the two
shards, but pulling the fields from the main index (eg
/solr/select?fq=tag1&shards=main_index/path,tag_index/path&q=*:*).

My specific use case is a bit more involved, but if there were either
some way to deterministically pick the shard source *or* to
dynamically (additively) merge the multiple docs sharing the same
uniqueKey from separate shards, it would be quite helpful.  The later
would provide the general case functionality to have partial document
updates, except even more powerful.  However, I could get by with just
the former - using the main index for all scoring but being able to
augment documents for filtering.

I'm not a solr expert by any means, so if there is another recommended
way to achieve that functionality, I'd love some guidance.  Or, if
this is just a rare case, I guess it'd be time for me to roll up my
sleeves and hack up some solr code.  Making QueryComponent
configurably deterministic would suffice (eg a
"shard.primary=main_index/path" parameter, perhaps?  or even just
treating the shards parameter as an ordered list with the primary
first?).  Adding field merging would likely be... more involved
though.

Thanks in advance for any advice!
-pete

Re: Distributed Search

Reply via email to