Thanks Hoss, Externanlizing this part is exactly the path we are exploring now, not only for this reason.
We already started testing Hadoop SequenceFile for write ahead log for updates/deletes. SequenceFile supports append now (simply great!). It was a a pain to have to add hadoop into mix for "mortal" collection sizes 200 Mio, but on the other side, having hadoop around offers huge flexibility. Write ahead log catches update commands (all solr slaves, fronting clients accept updates but only to forward them to WAL). Solr master is trying to catch up with update stream indexing in async fashion, and finally solr slaves are chasing master index with standard solr replication. Overnight we run simple map reduce jobs to consolidate, normalize and sort update stream and reindex at the end. Deduplication and collection sorting is for us only an optimization, if done reasonably offten, like once per day/week, but if we do not do it, it doubles HW resorces. Imo, native WAL support on solr would be definitly one nice "nice to have" (for HA, update scalability...). Charming with WAL is that updates never wait/disapear, if too much traffic, we only have slightly higher update latency, but updates get definitley processed. Some basic primitives on WAL (consolidation, replaying update stream on solr etc...) should be supported in this case, sort of "smallish hadoop features subset for solr clusters", but nothing oversized. Cheers, eks On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > : Is it possible in solr to have multivalued "id"? Or I need to make my > : own "mv_ID" for this? Any ideas how to achieve this efficiently? > > This isn't something the SignatureUpdateProcessor is going to be able to > hel pyou with -- it does the deduplication be changing hte low level > "update" (implemented as a delete then add) so that the key used to delete > the older documents is based on the signature field instead of the id > field. > > in order to do what you are describing, you would need to query the index > for matching signatures, then add the resulting ids to your document > before doing that "update" > > You could posibly do this in a custom UpdateProcessor, but you'd have to > do something tricky to ensure you didn't overlook docs that had been addd > but not yet committed when checking for dups. > > I don't have a good suggestion for how to do this internally in Slr -- it > seems like the type of bulk processing logic that would be better suited > for an external process before you ever start indexing (much like link > analysis for back refrences) > > -Hoss >