Deduplication

eks dev Mon, 04 Apr 2011 14:18:46 -0700

Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for "mortal" collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice "nice to
have" (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of "smallish
hadoop features subset for solr clusters", but nothing oversized.

Cheers,
eks

On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
<hossman_luc...@fucit.org> wrote:
>
> : Is it possible in solr to have multivalued "id"? Or I need to make my
> : own "mv_ID" for this? Any ideas how to achieve this efficiently?
>
> This isn't something the SignatureUpdateProcessor is going to be able to
> hel pyou with -- it does the deduplication be changing hte low level
> "update" (implemented as a delete then add) so that the key used to delete
> the older documents is based on the signature field instead of the id
> field.
>
> in order to do what you are describing, you would need to query the index
> for matching signatures, then add the resulting ids to your document
> before doing that "update"
>
> You could posibly do this in a custom UpdateProcessor, but you'd have to
> do something tricky to ensure you didn't overlook docs that had been addd
> but not yet committed when checking for dups.
>
> I don't have a good suggestion for how to do this internally in Slr -- it
> seems like the type of bulk processing logic that would be better suited
> for an external process before you ever start indexing (much like link
> analysis for back refrences)
>
> -Hoss
>

Re: Question about http://wiki.apache.org/solr/Deduplication

Reply via email to