I'd guess it would be much faster, assuming that the search savings wouldn't be swamped by the additional transmission time over the wire and parsing the request (although SolrJ uses a binary format, so parsing request probably isn't all that expensive).
You could even do a hybrid approach. Pack up all of the IDs you are about to update, send them to your special *request* handler and have your request handler respond with the documents that were already in the index... Hmmm, scratch all that. Start with just stringing together a long set of <uniqueKeys> and just search for them. Something like q=id:(1 2 47 09873............)&fl=id The response should be a minimal set of data returned (just the ID). Then you can remove each document ID returned from your next update. No custom Solr components required. Solr defaults to a maxBooleanClause count of 1024, so your packets should have fewer IDs this or you should bump that config setting. This should pretty much do what I was thinking with custom code without having to write anything.. Best Erick On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov <alexander.aris...@gmail.com> wrote: > I have never developed for solr yet and don't know much internals but Today > I have tried one approach with searcher. > > In my update processor I get searcher and search for ID. It works but I > need to load test it. Will index traversal be faster (less resource > consuming) than search? > > Best Regards > Alexander Aristov > > > On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> wrote: > >> Hmmm, we're not communicating <G>... >> >> The update processor wouldn't search in the >> classic sense. It would just use lower-level >> index traversal to determine if the doc (identified >> by your unique key) was already in the index >> and skip indexing that document if it was. No real >> *searching* involved (see TermDocs.seek for one >> approach). >> >> The price would be that you are transmitting the >> document over to the Solr instance and then >> throwing it away. >> >> Best >> Erick >> >> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev >> <mkhlud...@griddynamics.com> wrote: >> > Alexander, >> > >> > I have two ideas how to implement fast dedupe externally, assuming your >> PKs >> > don't fit to java.util.*Map: >> > >> > - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; >> > - if your crawler is stateless - it doesn't track PKs which has been >> > already crawled, you can retrieve it from Solr via >> > http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, >> but >> > it might be a problem with removed documents (I'm not sure). And it's >> also >> > can lead to OOMException (if you have too much PKs). Let me know if you >> > need a workaround for one of these problems. >> > >> > If you choose internal dedupe (UpdateProcessor), pls let me know if >> > querying one-by-one will be to slow for your and you'll need to do it >> > page-by-page. I did some of such paging, and will do something similar >> > soon, so I'm interested in it. >> > >> > Regards >> > >> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov < >> > alexander.aris...@gmail.com> wrote: >> > >> >> Unfortunately I have a lot of duplicates and taking that searching >> might >> >> suffer I will try with implementing update procesor. >> >> >> >> But your idea is interesting and I will consider it, thanks. >> >> >> >> Best Regards >> >> Alexander Aristov >> >> >> >> >> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote: >> >> >> >> > Hello Alexander, >> >> > >> >> > I don't know much about your requirements in terms of size and >> >> > performances, but I've had a similar use case and found a pretty >> simple >> >> > workaround. >> >> > If your duplicate rate is not too high, you can have the >> >> > SignatureProcessor to generate fingerprint of documents (you already >> did >> >> > that). >> >> > >> >> > Simply turn off overwritting of duplicates, you can then rely on >> solr's >> >> > grouping / field collapsing to group your search results by >> fingerprints. >> >> > You'll then have one document group per "real" document. You can use >> >> > group.sort to sort your groups by indexing date ascending, and >> >> > group.limit=1 to keep only the oldest one. >> >> > You can even use group.format = simple to serve results as if no >> >> > collapsing occured, and use group.ngroups (/!\ could be expansive >> /!\) to >> >> > get the real number of deduplicated documents. >> >> > >> >> > Of course the index will be larger, as I said, I made no assumption >> >> > regarding your operating requirements. And search can be a bit slower, >> >> > depending on the average rate of duplicated documents. >> >> > But you've got your issue addressed by configuration tuning only... >> >> > Depending on your project's sizing, it could be time saving. >> >> > >> >> > The advantage is that you have the precious information of what >> content >> >> is >> >> > duplicated from where :-) >> >> > >> >> > Hope this helps, >> >> > >> >> > -- >> >> > Tanguy >> >> > >> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit : >> >> > >> >> > Thanks Eric, >> >> >> >> >> >> it sets me direction. I will be writing new plugin and will get back >> to >> >> >> the >> >> >> dev forum with results and then we will decide next steps. >> >> >> >> >> >> Best Regards >> >> >> Alexander Aristov >> >> >> >> >> >> >> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com< >> >> erickerick...@gmail.com>> >> >> >> wrote: >> >> >> >> >> >> Well, the short answer is that nobody else has >> >> >>> 1> had a similar requirement >> >> >>> AND >> >> >>> 2> not found a suitable work around >> >> >>> AND >> >> >>> 3> implemented the change and contributed it back. >> >> >>> >> >> >>> So, if you'd like to volunteer<G>..... >> >> >>> >> >> >>> Seriously. If you think this would be valuable and are >> >> >>> willing to work on it, hop on over to the dev list and >> >> >>> discuss it, open a JIRA and make it work. I'd start >> >> >>> by opening a discussion on the dev list before >> >> >>> opening a JIRA, just to get a sense of where the >> >> >>> snags would be to changing the Solr code, but that's >> >> >>> optional. >> >> >>> >> >> >>> That said, writing your own update request handler >> >> >>> that detected this case isn't very difficult, >> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor >> >> >>> and use it as a plugin. >> >> >>> >> >> >>> Best >> >> >>> Erick >> >> >>> >> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov >> >> >>> <alexander.aris...@gmail.com> wrote: >> >> >>> >> >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it >> >> REPLACES >> >> >>>> >> >> >>> old >> >> >>> >> >> >>>> docs. I have tried it already. >> >> >>>> >> >> >>>> Best Regards >> >> >>>> Alexander Aristov >> >> >>>> >> >> >>>> >> >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com> >> wrote: >> >> >>>> >> >> >>>> The SignatureUpdateProcessor is for exactly this problem: >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> http://www.lucidimagination.**com/search/link?url=http://** >> >> >>> wiki.apache.org/solr/**Deduplication< >> >> >> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication >> >> > >> >> >>> >> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov >> >> >>>>> <alexander.aris...@gmail.com> wrote: >> >> >>>>> >> >> >>>>>> I get docs from external sources and the only place I keep them >> is >> >> >>>>>> >> >> >>>>> solr >> >> >>> >> >> >>>> index. I have no a database or other means to track indexed docs >> (my >> >> >>>>>> personal oppinion is that it might be a huge headache). >> >> >>>>>> >> >> >>>>>> Some docs might change slightly in there original sources but I >> >> don't >> >> >>>>>> >> >> >>>>> need >> >> >>>>> >> >> >>>>>> that changes. In fact I need original data only. >> >> >>>>>> >> >> >>>>>> So I have no other ways but to either check if a document is >> already >> >> >>>>>> >> >> >>>>> in >> >> >>> >> >> >>>> index before I put it to solrj array (read - query solr) or >> develop my >> >> >>>>>> >> >> >>>>> own >> >> >>>>> >> >> >>>>>> update chain processor and implement ID check there and skip such >> >> >>>>>> >> >> >>>>> docs. >> >> >>> >> >> >>>> Maybe it's wrong place to aguee and probably it's been discussed >> >> >>>>>> >> >> >>>>> before >> >> >>> >> >> >>>> but >> >> >>>>> >> >> >>>>>> I wonder why simple the overwrite parameter doesn't work here. >> >> >>>>>> >> >> >>>>>> My oppinion it perfectly suits here. In combination with unique >> ID >> >> it >> >> >>>>>> >> >> >>>>> can >> >> >>> >> >> >>>> cover all possible variants. >> >> >>>>>> >> >> >>>>>> cases: >> >> >>>>>> >> >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should >> overwrite >> >> >>>>>> >> >> >>>>> the >> >> >>> >> >> >>>> old one. >> >> >>>>>> >> >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be >> >> skipped >> >> >>>>>> >> >> >>>>> since >> >> >>>>> >> >> >>>>>> old exists. >> >> >>>>>> >> >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added >> regardless >> >> if >> >> >>>>>> >> >> >>>>> old >> >> >>>>> >> >> >>>>>> exists or not. >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> Best Regards >> >> >>>>>> Alexander Aristov >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail. >> >> **com<erickerick...@gmail.com> >> >> >>>>>> > >> >> >>>>>> >> >> >>>>> wrote: >> >> >>>>> >> >> >>>>>> Mikhail is right as far as I know, the assumption built into >> Solr is >> >> >>>>>>> >> >> >>>>>> that >> >> >>>>> >> >> >>>>>> duplicate IDs (when<uniqueKey> is defined) should trigger the >> old >> >> >>>>>>> document to be replaced. >> >> >>>>>>> >> >> >>>>>>> what is your system-of-record? By that I mean what does your >> SolrJ >> >> >>>>>>> program do to send data to Solr? Is there any way you could just >> >> >>>>>>> *not* send documents that are already in the Solr index based >> on, >> >> >>>>>>> for instance, any timestamp associated with your >> system-of-record >> >> >>>>>>> and the last time you did an incremental index? >> >> >>>>>>> >> >> >>>>>>> Best >> >> >>>>>>> Erick >> >> >>>>>>> >> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov >> >> >>>>>>> <alexander.aris...@gmail.com> wrote: >> >> >>>>>>> >> >> >>>>>>>> Hi >> >> >>>>>>>> >> >> >>>>>>>> I am not using database. All needed data is in solr index >> that's >> >> >>>>>>>> >> >> >>>>>>> why I >> >> >>> >> >> >>>> want >> >> >>>>>>> >> >> >>>>>>>> to skip excessive checks. >> >> >>>>>>>> >> >> >>>>>>>> I will check DIH but not sure if it helps. >> >> >>>>>>>> >> >> >>>>>>>> I am fluent with Java and it's not a problem for me to write a >> >> >>>>>>>> >> >> >>>>>>> class >> >> >>> >> >> >>>> or >> >> >>>>> >> >> >>>>>> so >> >> >>>>>>> >> >> >>>>>>>> but I want to check first maybe there are any ways >> (workarounds) >> >> >>>>>>>> >> >> >>>>>>> to >> >> >>> >> >> >>>> make >> >> >>>>> >> >> >>>>>> it working without codding, just by playing around with >> >> >>>>>>>> >> >> >>>>>>> configuration >> >> >>> >> >> >>>> and >> >> >>>>> >> >> >>>>>> params. I don't want to go away from default solr implementation. >> >> >>>>>>>> >> >> >>>>>>>> Best Regards >> >> >>>>>>>> Alexander Aristov >> >> >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev< >> >> >>>>>>>> >> >> >>>>>>> mkhlud...@griddynamics.com >> >> >>>>> >> >> >>>>>> wrote: >> >> >>>>>>>> >> >> >>>>>>>> On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov< >> >> >>>>>>>>> alexander.aris...@gmail.com> wrote: >> >> >>>>>>>>> >> >> >>>>>>>>> Hi people, >> >> >>>>>>>>>> >> >> >>>>>>>>>> I urgently need your help! >> >> >>>>>>>>>> >> >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental >> >> >>>>>>>>>> >> >> >>>>>>>>> indexing 4 >> >> >>> >> >> >>>> times a >> >> >>>>>>>>> >> >> >>>>>>>>>> day using bulk updates. Some documents are identical to some >> >> >>>>>>>>>> >> >> >>>>>>>>> extent >> >> >>> >> >> >>>> and I >> >> >>>>>>> >> >> >>>>>>>> wish to skip them, not to index. >> >> >>>>>>>>>> But here is the problem as I could not find a way to tell >> solr >> >> >>>>>>>>>> >> >> >>>>>>>>> ignore >> >> >>>>> >> >> >>>>>> new >> >> >>>>>>> >> >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that >> it's >> >> >>>>>>>>>> >> >> >>>>>>>>> new. >> >> >>>>> >> >> >>>>>> Just >> >> >>>>>>>>> >> >> >>>>>>>>>> determine by ID that such document is in the index already >> and >> >> >>>>>>>>>> >> >> >>>>>>>>> that's >> >> >>>>> >> >> >>>>>> it. >> >> >>>>>>> >> >> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false >> >> >>>>>>>>>> >> >> >>>>>>>>> and >> >> >>> >> >> >>>> dedupe >> >> >>>>>>> >> >> >>>>>>>> apprache but nothing helped me. I either have that a newer doc >> >> >>>>>>>>>> >> >> >>>>>>>>> overwrites >> >> >>>>>>> >> >> >>>>>>>> old one or I get duplicate. >> >> >>>>>>>>>> >> >> >>>>>>>>>> I think it's a very simple and basic feature and it must >> exist. >> >> >>>>>>>>>> >> >> >>>>>>>>> What >> >> >>>>> >> >> >>>>>> did >> >> >>>>>>> >> >> >>>>>>>> I >> >> >>>>>>>>> >> >> >>>>>>>>>> make wrong or didn't do? >> >> >>>>>>>>>> >> >> >>>>>>>>>> I guess, because the mainstream approach is delta-import , >> >> when >> >> >>>>>>>>> >> >> >>>>>>>> you >> >> >>> >> >> >>>> have >> >> >>>>>>> >> >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp >> stored >> >> >>>>>>>>> somewhere. You can check how it works in DIH. >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> Tried google but I couldn't find a solution there althoght >> many >> >> >>>>>>>>>> >> >> >>>>>>>>> people >> >> >>>>> >> >> >>>>>> encounted such problem. >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> it's definitely can be done by overriding >> >> >>>>>>>>> >> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand), >> >> >>>>>>>>> but I >> >> >>>>>>>>> >> >> >>>>>>>> suggest >> >> >>>>>>> >> >> >>>>>>>> to start from implementing your own >> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor< >> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for >> >> >>>>>>>>> >> >> >>>>>>>> PK, >> >> >>> >> >> >>>> bypass >> >> >>>>>>> >> >> >>>>>>>> chain call if it's found. Then if you meet performance issues >> on >> >> >>>>>>>>> >> >> >>>>>>>> querying >> >> >>>>>>> >> >> >>>>>>>> your PKs one by one, (but only after that) you can batch your >> >> >>>>>>>>> >> >> >>>>>>>> searches, >> >> >>>>> >> >> >>>>>> there are couple of optimization techniques for huge disjunction >> >> >>>>>>>>> >> >> >>>>>>>> queries >> >> >>>>> >> >> >>>>>> like PK:(2 OR 4 OR 5 OR 6). >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> I start considering that I must query index to check if a doc >> >> >>>>>>>>>> >> >> >>>>>>>>> to be >> >> >>> >> >> >>>> added >> >> >>>>>>> >> >> >>>>>>>> is in the index already and do not add it to array but I have >> so >> >> >>>>>>>>>> >> >> >>>>>>>>> many >> >> >>>>> >> >> >>>>>> docs >> >> >>>>>>>>> >> >> >>>>>>>>>> that I am affraid it's not a good solution. >> >> >>>>>>>>>> >> >> >>>>>>>>>> Best Regards >> >> >>>>>>>>>> Alexander Aristov >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> -- >> >> >>>>>>>>> Sincerely yours >> >> >>>>>>>>> Mikhail Khludnev >> >> >>>>>>>>> Lucid Certified >> >> >>>>>>>>> Apache Lucene/Solr Developer >> >> >>>>>>>>> Grid Dynamics >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>> >> >> >>>>> -- >> >> >>>>> Lance Norskog >> >> >>>>> goks...@gmail.com >> >> >>>>> >> >> >>>>> >> >> > >> >> >> > >> > >> > >> > -- >> > Sincerely yours >> > Mikhail Khludnev >> > Lucid Certified >> > Apache Lucene/Solr Developer >> > Grid Dynamics >> > >> > <http://www.griddynamics.com> >> > <mkhlud...@griddynamics.com> >>