Re: solr keep old docs

Erick Erickson Thu, 29 Dec 2011 06:12:30 -0800

I'd guess it would be much faster, assuming that
the search savings wouldn't be swamped by the
additional transmission time over the wire and
parsing the request (although SolrJ uses a binary
format, so parsing request probably isn't all
that expensive).


You could even do a hybrid approach. Pack up all
of the IDs you are about to update, send them to
your special *request* handler and have your
request handler respond with the documents that
were already in the index...

Hmmm, scratch all that. Start with just stringing
together a long set of <uniqueKeys> and just
search for them. Something like
q=id:(1 2 47 09873............)&fl=id
The response should be a minimal set of data
returned (just the ID). Then you can remove
each document ID returned from your
next update. No custom Solr components
required.

Solr defaults to a maxBooleanClause count
of 1024, so your packets should have fewer IDs
this or you should bump that config setting.

This should pretty much do what I was thinking
with custom code without having to write
anything..

Best
Erick

On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov
<alexander.aris...@gmail.com> wrote:
> I have never developed for solr yet and don't know much internals but Today
> I have tried one approach with searcher.
>
> In my update processor I get searcher and search for ID. It works but I
> need to load test it. Will index traversal be faster (less resource
> consuming) than search?
>
> Best Regards
> Alexander Aristov
>
>
> On 29 December 2011 17:03, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> Hmmm, we're not communicating <G>...
>>
>> The update processor wouldn't search in the
>> classic sense. It would just use lower-level
>> index traversal to determine if the doc (identified
>> by your unique key) was already in the index
>> and skip indexing that document if it was. No real
>> *searching* involved (see TermDocs.seek for one
>> approach).
>>
>> The price would be that you are transmitting the
>> document over to the Solr instance and then
>> throwing it away.
>>
>> Best
>> Erick
>>
>> On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev
>> <mkhlud...@griddynamics.com> wrote:
>> > Alexander,
>> >
>> > I have two ideas how to implement fast dedupe externally, assuming your
>> PKs
>> > don't fit to java.util.*Map:
>> >
>> >   - your crawler can use inprocess RDBMS (Derby, H2) to track dupes;
>> >   - if your crawler is stateless - it doesn't track PKs which has been
>> >   already crawled, you can retrieve it from Solr via
>> >   http://wiki.apache.org/solr/TermsComponent . That's blazingly fast,
>> but
>> >   it might be a problem with removed documents (I'm not sure). And it's
>> also
>> >   can lead to OOMException (if you have too much PKs). Let me know if you
>> >   need a workaround for one of these problems.
>> >
>> > If you choose internal dedupe (UpdateProcessor), pls let me know if
>> > querying one-by-one will be to slow for your and you'll need to do it
>> > page-by-page. I did some of such paging, and will do something similar
>> > soon, so I'm interested in it.
>> >
>> > Regards
>> >
>> > On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov <
>> > alexander.aris...@gmail.com> wrote:
>> >
>> >> Unfortunately I have a lot of duplicates  and taking that searching
>> might
>> >> suffer I will try with implementing update procesor.
>> >>
>> >> But your idea is interesting and I will consider it, thanks.
>> >>
>> >> Best Regards
>> >> Alexander Aristov
>> >>
>> >>
>> >> On 28 December 2011 19:12, Tanguy Moal <tanguy.m...@gmail.com> wrote:
>> >>
>> >> > Hello Alexander,
>> >> >
>> >> > I don't know much about your requirements in terms of size and
>> >> > performances, but I've had a similar use case and found a pretty
>> simple
>> >> > workaround.
>> >> > If your duplicate rate is not too high, you can have the
>> >> > SignatureProcessor to generate fingerprint of documents (you already
>> did
>> >> > that).
>> >> >
>> >> > Simply turn off overwritting of duplicates, you can then rely on
>> solr's
>> >> > grouping / field collapsing to group your search results by
>> fingerprints.
>> >> > You'll then have one document group per "real" document. You can use
>> >> > group.sort to sort your groups by indexing date ascending, and
>> >> > group.limit=1 to keep only the oldest one.
>> >> > You can even use group.format = simple to serve results as if no
>> >> > collapsing occured, and use group.ngroups (/!\ could be expansive
>> /!\) to
>> >> > get the real number of deduplicated documents.
>> >> >
>> >> > Of course the index will be larger, as I said, I made no assumption
>> >> > regarding your operating requirements. And search can be a bit slower,
>> >> > depending on the average rate of duplicated documents.
>> >> > But you've got your issue addressed by configuration tuning only...
>> >> > Depending on your project's sizing, it could be time saving.
>> >> >
>> >> > The advantage is that you have the precious information of what
>> content
>> >> is
>> >> > duplicated from where :-)
>> >> >
>> >> > Hope this helps,
>> >> >
>> >> > --
>> >> > Tanguy
>> >> >
>> >> > Le 28/12/2011 15:45, Alexander Aristov a écrit :
>> >> >
>> >> >  Thanks Eric,
>> >> >>
>> >> >> it sets me direction. I will be writing new plugin and will get back
>> to
>> >> >> the
>> >> >> dev forum with results and then we will decide next steps.
>> >> >>
>> >> >> Best Regards
>> >> >> Alexander Aristov
>> >> >>
>> >> >>
>> >> >> On 28 December 2011 18:08, Erick Erickson<erickerickson@gmail.**com<
>> >> erickerick...@gmail.com>>
>> >> >>  wrote:
>> >> >>
>> >> >>  Well, the short answer is that nobody else has
>> >> >>> 1>  had a similar requirement
>> >> >>> AND
>> >> >>> 2>  not found a suitable work around
>> >> >>> AND
>> >> >>> 3>  implemented the change and contributed it back.
>> >> >>>
>> >> >>> So, if you'd like to volunteer<G>.....
>> >> >>>
>> >> >>> Seriously. If you think this would be valuable and are
>> >> >>> willing to work on it, hop on over to the dev list and
>> >> >>> discuss it, open a JIRA and make it work. I'd start
>> >> >>> by opening a discussion on the dev list before
>> >> >>> opening a JIRA, just to get a sense of where the
>> >> >>> snags would be to changing the Solr code, but that's
>> >> >>> optional.
>> >> >>>
>> >> >>> That said, writing your own update request handler
>> >> >>> that detected this case isn't very difficult,
>> >> >>> extend UpdateRequestProcessorFactory/**UpdateRequestProcessor
>> >> >>> and use it as a plugin.
>> >> >>>
>> >> >>> Best
>> >> >>> Erick
>> >> >>>
>> >> >>> On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov
>> >> >>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>
>> >> >>>> the problem with dedupe (SignatureUpdateProcessor ) is that it
>> >> REPLACES
>> >> >>>>
>> >> >>> old
>> >> >>>
>> >> >>>> docs. I have tried it already.
>> >> >>>>
>> >> >>>> Best Regards
>> >> >>>> Alexander Aristov
>> >> >>>>
>> >> >>>>
>> >> >>>> On 28 December 2011 13:04, Lance Norskog<goks...@gmail.com>
>>  wrote:
>> >> >>>>
>> >> >>>>  The SignatureUpdateProcessor is for exactly this problem:
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>  http://www.lucidimagination.**com/search/link?url=http://**
>> >> >>> wiki.apache.org/solr/**Deduplication<
>> >>
>> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication
>> >> >
>> >> >>>
>> >> >>>> On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov
>> >> >>>>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>>>
>> >> >>>>>> I get docs from external sources and the only place I keep them
>> is
>> >> >>>>>>
>> >> >>>>> solr
>> >> >>>
>> >> >>>> index. I have no a database or other means to track indexed docs
>> (my
>> >> >>>>>> personal oppinion is that it might be a huge headache).
>> >> >>>>>>
>> >> >>>>>> Some docs might change slightly in there original sources but I
>> >> don't
>> >> >>>>>>
>> >> >>>>> need
>> >> >>>>>
>> >> >>>>>> that changes. In fact I need original data only.
>> >> >>>>>>
>> >> >>>>>> So I have no other ways but to either check if a document is
>> already
>> >> >>>>>>
>> >> >>>>> in
>> >> >>>
>> >> >>>> index before I put it to solrj array (read - query solr) or
>> develop my
>> >> >>>>>>
>> >> >>>>> own
>> >> >>>>>
>> >> >>>>>> update chain processor and implement ID check there and skip such
>> >> >>>>>>
>> >> >>>>> docs.
>> >> >>>
>> >> >>>> Maybe it's wrong place to aguee and probably it's been discussed
>> >> >>>>>>
>> >> >>>>> before
>> >> >>>
>> >> >>>> but
>> >> >>>>>
>> >> >>>>>> I wonder why simple the overwrite parameter doesn't work here.
>> >> >>>>>>
>> >> >>>>>> My oppinion it perfectly suits here. In combination with unique
>> ID
>> >> it
>> >> >>>>>>
>> >> >>>>> can
>> >> >>>
>> >> >>>> cover all possible variants.
>> >> >>>>>>
>> >> >>>>>> cases:
>> >> >>>>>>
>> >> >>>>>> 1. overwrite=true and uniquID exists then newer doc should
>> overwrite
>> >> >>>>>>
>> >> >>>>> the
>> >> >>>
>> >> >>>> old one.
>> >> >>>>>>
>> >> >>>>>> 2. overwrite=false and uniqueID exists then newer doc must be
>> >> skipped
>> >> >>>>>>
>> >> >>>>> since
>> >> >>>>>
>> >> >>>>>> old exists.
>> >> >>>>>>
>> >> >>>>>> 3. uniqueID doesn't exist then newer doc just gets added
>> regardless
>> >> if
>> >> >>>>>>
>> >> >>>>> old
>> >> >>>>>
>> >> >>>>>> exists or not.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> Best Regards
>> >> >>>>>> Alexander Aristov
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> On 27 December 2011 22:53, Erick Erickson<erickerickson@gmail.
>> >> **com<erickerick...@gmail.com>
>> >> >>>>>> >
>> >> >>>>>>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>>> Mikhail is right as far as I know, the assumption built into
>> Solr is
>> >> >>>>>>>
>> >> >>>>>> that
>> >> >>>>>
>> >> >>>>>> duplicate IDs (when<uniqueKey>  is defined) should trigger the
>> old
>> >> >>>>>>> document to be replaced.
>> >> >>>>>>>
>> >> >>>>>>> what is your system-of-record? By that I mean what does your
>> SolrJ
>> >> >>>>>>> program do to send data to Solr? Is there any way you could just
>> >> >>>>>>> *not* send documents that are already in the Solr index based
>> on,
>> >> >>>>>>> for instance, any timestamp associated with your
>> system-of-record
>> >> >>>>>>> and the last time you did an incremental index?
>> >> >>>>>>>
>> >> >>>>>>> Best
>> >> >>>>>>> Erick
>> >> >>>>>>>
>> >> >>>>>>> On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov
>> >> >>>>>>> <alexander.aris...@gmail.com>  wrote:
>> >> >>>>>>>
>> >> >>>>>>>> Hi
>> >> >>>>>>>>
>> >> >>>>>>>> I am not using database. All needed data is in solr index
>> that's
>> >> >>>>>>>>
>> >> >>>>>>> why I
>> >> >>>
>> >> >>>>  want
>> >> >>>>>>>
>> >> >>>>>>>> to skip excessive checks.
>> >> >>>>>>>>
>> >> >>>>>>>> I will check DIH but not sure if it helps.
>> >> >>>>>>>>
>> >> >>>>>>>> I am fluent with Java and it's not a problem for me to write a
>> >> >>>>>>>>
>> >> >>>>>>> class
>> >> >>>
>> >> >>>> or
>> >> >>>>>
>> >> >>>>>> so
>> >> >>>>>>>
>> >> >>>>>>>> but I want to check first  maybe there are any ways
>> (workarounds)
>> >> >>>>>>>>
>> >> >>>>>>> to
>> >> >>>
>> >> >>>> make
>> >> >>>>>
>> >> >>>>>> it working without codding, just by playing around with
>> >> >>>>>>>>
>> >> >>>>>>> configuration
>> >> >>>
>> >> >>>> and
>> >> >>>>>
>> >> >>>>>> params. I don't want to go away from default solr implementation.
>> >> >>>>>>>>
>> >> >>>>>>>> Best Regards
>> >> >>>>>>>> Alexander Aristov
>> >> >>>>>>>>
>> >> >>>>>>>>
>> >> >>>>>>>> On 27 December 2011 09:33, Mikhail Khludnev<
>> >> >>>>>>>>
>> >> >>>>>>> mkhlud...@griddynamics.com
>> >> >>>>>
>> >> >>>>>> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>>  On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov<
>> >> >>>>>>>>> alexander.aris...@gmail.com>  wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>>  Hi people,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I urgently need your help!
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I have solr 3.3 configured and running. I do uncremental
>> >> >>>>>>>>>>
>> >> >>>>>>>>> indexing 4
>> >> >>>
>> >> >>>>  times a
>> >> >>>>>>>>>
>> >> >>>>>>>>>> day using bulk updates. Some documents are identical to some
>> >> >>>>>>>>>>
>> >> >>>>>>>>> extent
>> >> >>>
>> >> >>>>  and I
>> >> >>>>>>>
>> >> >>>>>>>> wish to skip them, not to index.
>> >> >>>>>>>>>> But here is the problem as I could not find a way to tell
>> solr
>> >> >>>>>>>>>>
>> >> >>>>>>>>> ignore
>> >> >>>>>
>> >> >>>>>> new
>> >> >>>>>>>
>> >> >>>>>>>> duplicate docs and keep old indexed docs. I don't care that
>> it's
>> >> >>>>>>>>>>
>> >> >>>>>>>>> new.
>> >> >>>>>
>> >> >>>>>>  Just
>> >> >>>>>>>>>
>> >> >>>>>>>>>> determine by ID that such document is in the index already
>> and
>> >> >>>>>>>>>>
>> >> >>>>>>>>> that's
>> >> >>>>>
>> >> >>>>>> it.
>> >> >>>>>>>
>> >> >>>>>>>> I use solrj for indexing. I have tried setting overwrite=false
>> >> >>>>>>>>>>
>> >> >>>>>>>>> and
>> >> >>>
>> >> >>>>  dedupe
>> >> >>>>>>>
>> >> >>>>>>>> apprache but nothing helped me. I either have that a newer doc
>> >> >>>>>>>>>>
>> >> >>>>>>>>> overwrites
>> >> >>>>>>>
>> >> >>>>>>>> old one or I get duplicate.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> I think it's a very simple and basic feature and it must
>> exist.
>> >> >>>>>>>>>>
>> >> >>>>>>>>> What
>> >> >>>>>
>> >> >>>>>> did
>> >> >>>>>>>
>> >> >>>>>>>> I
>> >> >>>>>>>>>
>> >> >>>>>>>>>> make wrong or didn't do?
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>  I guess, because  the mainstream approach is delta-import ,
>> >> when
>> >> >>>>>>>>>
>> >> >>>>>>>> you
>> >> >>>
>> >> >>>>  have
>> >> >>>>>>>
>> >> >>>>>>>> "updated" timestamps in your DB and "last-import" timestamp
>> stored
>> >> >>>>>>>>> somewhere. You can check how it works in DIH.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>  Tried google but I couldn't find a solution there althoght
>> many
>> >> >>>>>>>>>>
>> >> >>>>>>>>> people
>> >> >>>>>
>> >> >>>>>>  encounted such problem.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>  it's definitely can be done by overriding
>> >> >>>>>>>>>
>> o.a.s.update.**DirectUpdateHandler2.addDoc(**AddUpdateCommand),
>> >> >>>>>>>>> but I
>> >> >>>>>>>>>
>> >> >>>>>>>> suggest
>> >> >>>>>>>
>> >> >>>>>>>> to start from implementing your own
>> >> >>>>>>>>> http://wiki.apache.org/solr/**UpdateRequestProcessor<
>> >> http://wiki.apache.org/solr/UpdateRequestProcessor>- search for
>> >> >>>>>>>>>
>> >> >>>>>>>> PK,
>> >> >>>
>> >> >>>>  bypass
>> >> >>>>>>>
>> >> >>>>>>>> chain call if it's found. Then if you meet performance issues
>> on
>> >> >>>>>>>>>
>> >> >>>>>>>> querying
>> >> >>>>>>>
>> >> >>>>>>>> your PKs one by one, (but only after that) you can batch your
>> >> >>>>>>>>>
>> >> >>>>>>>> searches,
>> >> >>>>>
>> >> >>>>>>  there are couple of optimization techniques for huge disjunction
>> >> >>>>>>>>>
>> >> >>>>>>>> queries
>> >> >>>>>
>> >> >>>>>>  like PK:(2 OR 4 OR 5 OR 6).
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>  I start considering that I must query index to check if a doc
>> >> >>>>>>>>>>
>> >> >>>>>>>>> to be
>> >> >>>
>> >> >>>>  added
>> >> >>>>>>>
>> >> >>>>>>>> is in the index already and do not add it to array but I have
>> so
>> >> >>>>>>>>>>
>> >> >>>>>>>>> many
>> >> >>>>>
>> >> >>>>>>  docs
>> >> >>>>>>>>>
>> >> >>>>>>>>>> that I am affraid it's not a good solution.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Best Regards
>> >> >>>>>>>>>> Alexander Aristov
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> --
>> >> >>>>>>>>> Sincerely yours
>> >> >>>>>>>>> Mikhail Khludnev
>> >> >>>>>>>>> Lucid Certified
>> >> >>>>>>>>> Apache Lucene/Solr Developer
>> >> >>>>>>>>> Grid Dynamics
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>
>> >> >>>>> --
>> >> >>>>> Lance Norskog
>> >> >>>>> goks...@gmail.com
>> >> >>>>>
>> >> >>>>>
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Lucid Certified
>> > Apache Lucene/Solr Developer
>> > Grid Dynamics
>> >
>> > <http://www.griddynamics.com>
>> >  <mkhlud...@griddynamics.com>
>>

Re: solr keep old docs

Reply via email to