I'd guess it would be much faster, assuming that
the search savings wouldn't be swamped by the
additional transmission time over the wire and
parsing the request (although SolrJ uses a binary
format, so parsing request probably isn't all
that expensive).

You could even do a hybrid approach. Pack up all
of the IDs you are about to update, send them to
your special *request* handler and have your
request handler respond with the documents that
were already in the index...

Hmmm, scratch all that. Start with just stringing
together a long set of <uniqueKeys> and just
search for them. Something like
q=id:(1 2 47 09873............)&fl=id
The response should be a minimal set of data
returned (just the ID). Then you can remove
each document ID returned from your
next update. No custom Solr components
required.

Solr defaults to a maxBooleanClause count
of 1024, so your packets should have fewer IDs
this or you should bump that config setting.

This should pretty much do what I was thinking
with custom code without having to write
anything..

Best
Erick

On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
> I have never developed for solr yet and don't know much internals but Today
> I have tried one approach with searcher.
>
> In my update processor I get searcher and search for ID. It works but I
> need to load test it. Will index traversal be faster (less resource
> consuming) than search?
>
> Best Regards
> Alexander Aristov
>
>
> On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Hmmm, we're not communicating <G>...
>>
>> The update processor wouldn't search in the
>> classic sense. It would just use lower-level
>> index traversal to determine if the doc (identified
>> by your unique key) was already in the index
>> and skip indexing that document if it was. No real
>> *searching* involved (see TermDocs.seek for one
>> approach).
>>
>> The price would be that you are transmitting the
>> document over to the Solr instance and then
>> throwing it away.
>>
>> Best
>> Erick
>>
>> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
>> <mkhlud...@griddynamics.com> wrote:
>> > Alexander,
>> >
>> > I have two ideas how to implement fast dedupe externally, assuming your
>> PKs
>> > don't fit to java.util.*Map:
>> >
>> >   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
>> >   - if your crawler is stateless - it doesn't track PKs which has been
>> >   already crawled, you can retrieve it from Solr via
>> >   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
>> but
>> >   it might be a problem with removed documents (I'm not sure). And it's
>> also
>> >   can lead to OOMException (if you have too much PKs). Let me know if you
>> >   need a workaround for one of these problems.
>> >
>> > If you choose internal dedupe (UpdateProcessor), pls let me know if
>> > querying one-by-one will be to slow for your and you'll need to do it
>> > page-by-page. I did some of such paging, and will do something similar
>> > soon, so I'm interested in it.
>> >
>> > Regards
>> >
>> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
>> > alexander.aris...@gmail.com> wrote:
>> >
>> >> Unfortunately I have a lot of duplicates  and taking that searching
>> might
>> >> suffer I will try with implementing update procesor.
>> >>
>> >> But your idea is interesting and I will consider it, thanks.
>> >>
>> >> Best Regards
>> >> Alexander Aristov
>> >>
>> >>
>> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote:
>> >>
>> >> > Hello Alexander,
>> >> >
>> >> > I don't know much about your requirements in terms of size and
>> >> > performances, but I've had a similar use case and found a pretty
>> simple
>> >> > workaround.
>> >> > If your duplicate rate is not too high, you can have the
>> >> > SignatureProcessor to generate fingerprint of documents (you already
>> did
>> >> > that).
>> >> >
>> >> > Simply turn off overwritting of duplicates, you can then rely on
>> solr's
>> >> > grouping / field collapsing to group your search results by
>> fingerprints.
>> >> > You'll then have one document group per "real" document. You can use
>> >> > group.sort to sort your groups by indexing date ascending, and
>> >> > group.limit=1 to keep only the oldest one.
>> >> > You can even use group.format = simple to serve results as if no
>> >> > collapsing occured, and use group.ngroups (/!\ could be expansive
>> /!\) to
>> >> > get the real number of deduplicated documents.
>> >> >
>> >> > Of course the index will be larger, as I said, I made no assumption
>> >> > regarding your operating requirements. And search can be a bit slower,
>> >> > depending on the average rate of duplicated documents.
>> >> > But you've got your issue addressed by configuration tuning only...
>> >> > Depending on your project's sizing, it could be time saving.
>> >> >
>> >> > The advantage is that you have the precious information of what
>> content
>> >> is
>> >> > duplicated from where :-)
>> >> >
>> >> > Hope this helps,
>> >> >
>> >> > --
>> >> > Tanguy
>> >> >
>> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
>> >> >
>> >> >  Thanks Eric,
>> >> >>
>> >> >> it sets me direction. I will be writing new plugin and will get back
>> to
>> >> >> the
>> >> >> dev forum with results and then we will decide next steps.
>> >> >>
>> >> >> Best Regards
>> >> >> Alexander Aristov
>> >> >>
>> >> >>
>> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com<
>> >> erickerick...@gmail.com>>
>> >> >>  wrote:
>> >> >>
>> >> >>  Well, the short answer is that nobody else has
>> >> >>> 1>  had a similar requirement
>> >> >>> AND
>> >> >>> 2>  not found a suitable work around
>> >> >>> AND
>> >> >>> 3>  implemented the change and contributed it back.
>> >> >>>
>> >> >>> So, if you'd like to volunteer<G>.....
>> >> >>>
>> >> >>> Seriously. If you think this would be valuable and are
>> >> >>> willing to work on it, hop on over to the dev list and
>> >> >>> discuss it, open a JIRA and make it work. I'd start
>> >> >>> by opening a discussion on the dev list before
>> >> >>> opening a JIRA, just to get a sense of where the
>> >> >>> snags would be to changing the Solr code, but that's
>> >> >>> optional.
>> >> >>>
>> >> >>> That said, writing your own update request handler
>> >> >>> that detected this case isn't very difficult,
>> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
>> >> >>> and use it as a plugin.
>> >> >>>
>> >> >>> Best
>> >> >>> Erick
>> >> >>>
>> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
>> >> >>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>
>> >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it
>> >> REPLACES
>> >> >>>>
>> >> >>> old
>> >> >>>
>> >> >>>> docs. I have tried it already.
>> >> >>>>
>> >> >>>> Best Regards
>> >> >>>> Alexander Aristov
>> >> >>>>
>> >> >>>>
>> >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>
>>  wrote:
>> >> >>>>
>> >> >>>>  The SignatureUpdateProcessor is for exactly this problem:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>  http://www.lucidimagination.**com/search/link?url=http://**
>> >> >>> wiki.apache.org/solr/**Deduplication<
>> >>
>> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
>> >> >
>> >> >>>
>> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>> >> >>>>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>>>
>> >> >>>>>> I get docs from external sources and the only place I keep them
>> is
>> >> >>>>>>
>> >> >>>>> solr
>> >> >>>
>> >> >>>> index. I have no a database or other means to track indexed docs
>> (my
>> >> >>>>>> personal oppinion is that it might be a huge headache).
>> >> >>>>>>
>> >> >>>>>> Some docs might change slightly in there original sources but I
>> >> don't
>> >> >>>>>>
>> >> >>>>> need
>> >> >>>>>
>> >> >>>>>> that changes. In fact I need original data only.
>> >> >>>>>>
>> >> >>>>>> So I have no other ways but to either check if a document is
>> already
>> >> >>>>>>
>> >> >>>>> in
>> >> >>>
>> >> >>>> index before I put it to solrj array (read - query solr) or
>> develop my
>> >> >>>>>>
>> >> >>>>> own
>> >> >>>>>
>> >> >>>>>> update chain processor and implement ID check there and skip such
>> >> >>>>>>
>> >> >>>>> docs.
>> >> >>>
>> >> >>>> Maybe it's wrong place to aguee and probably it's been discussed
>> >> >>>>>>
>> >> >>>>> before
>> >> >>>
>> >> >>>> but
>> >> >>>>>
>> >> >>>>>> I wonder why simple the overwrite parameter doesn't work here.
>> >> >>>>>>
>> >> >>>>>> My oppinion it perfectly suits here. In combination with unique
>> ID
>> >> it
>> >> >>>>>>
>> >> >>>>> can
>> >> >>>
>> >> >>>> cover all possible variants.
>> >> >>>>>>
>> >> >>>>>> cases:
>> >> >>>>>>
>> >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should
>> overwrite
>> >> >>>>>>
>> >> >>>>> the
>> >> >>>
>> >> >>>> old one.
>> >> >>>>>>
>> >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be
>> >> skipped
>> >> >>>>>>
>> >> >>>>> since
>> >> >>>>>
>> >> >>>>>> old exists.
>> >> >>>>>>
>> >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added
>> regardless
>> >> if
>> >> >>>>>>
>> >> >>>>> old
>> >> >>>>>
>> >> >>>>>> exists or not.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Best Regards
>> >> >>>>>> Alexander Aristov
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail.
>> >> **com<erickerick...@gmail.com>
>> >> >>>>>> >
>> >> >>>>>>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>>> Mikhail is right as far as I know, the assumption built into
>> Solr is
>> >> >>>>>>>
>> >> >>>>>> that
>> >> >>>>>
>> >> >>>>>> duplicate IDs (when<uniqueKey>  is defined) should trigger the
>> old
>> >> >>>>>>> document to be replaced.
>> >> >>>>>>>
>> >> >>>>>>> what is your system-of-record? By that I mean what does your
>> SolrJ
>> >> >>>>>>> program do to send data to Solr? Is there any way you could just
>> >> >>>>>>> *not* send documents that are already in the Solr index based
>> on,
>> >> >>>>>>> for instance, any timestamp associated with your
>> system-of-record
>> >> >>>>>>> and the last time you did an incremental index?
>> >> >>>>>>>
>> >> >>>>>>> Best
>> >> >>>>>>> Erick
>> >> >>>>>>>
>> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>> >> >>>>>>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Hi
>> >> >>>>>>>>
>> >> >>>>>>>> I am not using database. All needed data is in solr index
>> that's
>> >> >>>>>>>>
>> >> >>>>>>> why I
>> >> >>>
>> >> >>>>  want
>> >> >>>>>>>
>> >> >>>>>>>> to skip excessive checks.
>> >> >>>>>>>>
>> >> >>>>>>>> I will check DIH but not sure if it helps.
>> >> >>>>>>>>
>> >> >>>>>>>> I am fluent with Java and it's not a problem for me to write a
>> >> >>>>>>>>
>> >> >>>>>>> class
>> >> >>>
>> >> >>>> or
>> >> >>>>>
>> >> >>>>>> so
>> >> >>>>>>>
>> >> >>>>>>>> but I want to check first  maybe there are any ways
>> (workarounds)
>> >> >>>>>>>>
>> >> >>>>>>> to
>> >> >>>
>> >> >>>> make
>> >> >>>>>
>> >> >>>>>> it working without codding, just by playing around with
>> >> >>>>>>>>
>> >> >>>>>>> configuration
>> >> >>>
>> >> >>>> and
>> >> >>>>>
>> >> >>>>>> params. I don't want to go away from default solr implementation.
>> >> >>>>>>>>
>> >> >>>>>>>> Best Regards
>> >> >>>>>>>> Alexander Aristov
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev<
>> >> >>>>>>>>
>> >> >>>>>>> mkhlud...@griddynamics.com
>> >> >>>>>
>> >> >>>>>> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
>> >> >>>>>>>>> alexander.aris...@gmail.com>  wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>>  Hi people,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I urgently need your help!
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental
>> >> >>>>>>>>>>
>> >> >>>>>>>>> indexing 4
>> >> >>>
>> >> >>>>  times a
>> >> >>>>>>>>>
>> >> >>>>>>>>>> day using bulk updates. Some documents are identical to some
>> >> >>>>>>>>>>
>> >> >>>>>>>>> extent
>> >> >>>
>> >> >>>>  and I
>> >> >>>>>>>
>> >> >>>>>>>> wish to skip them, not to index.
>> >> >>>>>>>>>> But here is the problem as I could not find a way to tell
>> solr
>> >> >>>>>>>>>>
>> >> >>>>>>>>> ignore
>> >> >>>>>
>> >> >>>>>> new
>> >> >>>>>>>
>> >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that
>> it's
>> >> >>>>>>>>>>
>> >> >>>>>>>>> new.
>> >> >>>>>
>> >> >>>>>>  Just
>> >> >>>>>>>>>
>> >> >>>>>>>>>> determine by ID that such document is in the index already
>> and
>> >> >>>>>>>>>>
>> >> >>>>>>>>> that's
>> >> >>>>>
>> >> >>>>>> it.
>> >> >>>>>>>
>> >> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false
>> >> >>>>>>>>>>
>> >> >>>>>>>>> and
>> >> >>>
>> >> >>>>  dedupe
>> >> >>>>>>>
>> >> >>>>>>>> apprache but nothing helped me. I either have that a newer doc
>> >> >>>>>>>>>>
>> >> >>>>>>>>> overwrites
>> >> >>>>>>>
>> >> >>>>>>>> old one or I get duplicate.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I think it's a very simple and basic feature and it must
>> exist.
>> >> >>>>>>>>>>
>> >> >>>>>>>>> What
>> >> >>>>>
>> >> >>>>>> did
>> >> >>>>>>>
>> >> >>>>>>>> I
>> >> >>>>>>>>>
>> >> >>>>>>>>>> make wrong or didn't do?
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>  I guess, because  the mainstream approach is delta-import ,
>> >> when
>> >> >>>>>>>>>
>> >> >>>>>>>> you
>> >> >>>
>> >> >>>>  have
>> >> >>>>>>>
>> >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp
>> stored
>> >> >>>>>>>>> somewhere. You can check how it works in DIH.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>  Tried google but I couldn't find a solution there althoght
>> many
>> >> >>>>>>>>>>
>> >> >>>>>>>>> people
>> >> >>>>>
>> >> >>>>>>  encounted such problem.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>  it's definitely can be done by overriding
>> >> >>>>>>>>>
>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
>> >> >>>>>>>>> but I
>> >> >>>>>>>>>
>> >> >>>>>>>> suggest
>> >> >>>>>>>
>> >> >>>>>>>> to start from implementing your own
>> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<
>> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for
>> >> >>>>>>>>>
>> >> >>>>>>>> PK,
>> >> >>>
>> >> >>>>  bypass
>> >> >>>>>>>
>> >> >>>>>>>> chain call if it's found. Then if you meet performance issues
>> on
>> >> >>>>>>>>>
>> >> >>>>>>>> querying
>> >> >>>>>>>
>> >> >>>>>>>> your PKs one by one, (but only after that) you can batch your
>> >> >>>>>>>>>
>> >> >>>>>>>> searches,
>> >> >>>>>
>> >> >>>>>>  there are couple of optimization techniques for huge disjunction
>> >> >>>>>>>>>
>> >> >>>>>>>> queries
>> >> >>>>>
>> >> >>>>>>  like PK:(2 OR 4 OR 5 OR 6).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>  I start considering that I must query index to check if a doc
>> >> >>>>>>>>>>
>> >> >>>>>>>>> to be
>> >> >>>
>> >> >>>>  added
>> >> >>>>>>>
>> >> >>>>>>>> is in the index already and do not add it to array but I have
>> so
>> >> >>>>>>>>>>
>> >> >>>>>>>>> many
>> >> >>>>>
>> >> >>>>>>  docs
>> >> >>>>>>>>>
>> >> >>>>>>>>>> that I am affraid it's not a good solution.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Best Regards
>> >> >>>>>>>>>> Alexander Aristov
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> --
>> >> >>>>>>>>> Sincerely yours
>> >> >>>>>>>>> Mikhail Khludnev
>> >> >>>>>>>>> Lucid Certified
>> >> >>>>>>>>> Apache Lucene/Solr Developer
>> >> >>>>>>>>> Grid Dynamics
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Lance Norskog
>> >> >>>>> goks...@gmail.com
>> >> >>>>>
>> >> >>>>>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Lucid Certified
>> > Apache Lucene/Solr Developer
>> > Grid Dynamics
>> >
>> > <http://www.griddynamics.com>
>> >  <mkhlud...@griddynamics.com>
>>

Reply via email to