Re: solr keep old docs

2011-12-29 Thread Alexander Aristov
I have never developed for solr yet and don't know much internals but Today I have tried one approach with searcher. In my update processor I get searcher and search for ID. It works but I need to load test it. Will index traversal be faster (less resource consuming) than search? Best Regards

Re: solr keep old docs

2011-12-29 Thread Erick Erickson
Hmmm, we're not communicating G... The update processor wouldn't search in the classic sense. It would just use lower-level index traversal to determine if the doc (identified by your unique key) was already in the index and skip indexing that document if it was. No real *searching* involved (see

Re: solr keep old docs

2011-12-29 Thread Alexander Aristov
well. The first results are ready. I have implemented custom update processor following your suggestion using low level index reader and termdocs. Launched scripts which add about 10 000 docs. Indexing took about 1 minute including commit that is quite good for me. I don't have larger datasets so

Re: solr keep old docs

2011-12-29 Thread Erick Erickson
I'd guess it would be much faster, assuming that the search savings wouldn't be swamped by the additional transmission time over the wire and parsing the request (although SolrJ uses a binary format, so parsing request probably isn't all that expensive). You could even do a hybrid approach. Pack

Re: solr keep old docs

2011-12-28 Thread Lance Norskog
The SignatureUpdateProcessor is for exactly this problem: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov alexander.aris...@gmail.com wrote: I get docs from external sources and the only place I keep

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already. Best Regards Alexander Aristov On 28 December 2011 13:04, Lance Norskog goks...@gmail.com wrote: The SignatureUpdateProcessor is for exactly this problem:

Re: solr keep old docs

2011-12-28 Thread Erick Erickson
Well, the short answer is that nobody else has 1 had a similar requirement AND 2 not found a suitable work around AND 3 implemented the change and contributed it back. So, if you'd like to volunteer G. Seriously. If you think this would be valuable and are willing to work on it, hop on over

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Thanks Eric, it sets me direction. I will be writing new plugin and will get back to the dev forum with results and then we will decide next steps. Best Regards Alexander Aristov On 28 December 2011 18:08, Erick Erickson erickerick...@gmail.com wrote: Well, the short answer is that nobody

Re: solr keep old docs

2011-12-28 Thread Tanguy Moal
Hello Alexander, I don't know much about your requirements in terms of size and performances, but I've had a similar use case and found a pretty simple workaround. If your duplicate rate is not too high, you can have the SignatureProcessor to generate fingerprint of documents (you already did

Re: solr keep old docs

2011-12-28 Thread Chris Hostetter
: That said, writing your own update request handler : that detected this case isn't very difficult, : extend UpdateRequestProcessorFactory/UpdateRequestProcessor : and use it as a plugin. i can't find the thread at the moment, but the general issue that has caused people headaches with this

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Unfortunately I have a lot of duplicates and taking that searching might suffer I will try with implementing update procesor. But your idea is interesting and I will consider it, thanks. Best Regards Alexander Aristov On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote: Hello

Re: solr keep old docs

2011-12-28 Thread Alexander Aristov
Yes I have been warned that query index each time before adding doc to index might be resource consuming. Will check it. As for the overwrite parameter I think the name is not the best then. People outside the business like me misuse it and assume what I wrote. Overwrite shall mean what it means.

Re: solr keep old docs

2011-12-28 Thread Mikhail Khludnev
Alexander, I have two ideas how to implement fast dedupe externally, assuming your PKs don't fit to java.util.*Map: - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; - if your crawler is stateless - it doesn't track PKs which has been already crawled, you can retrieve

Re: solr keep old docs

2011-12-27 Thread Alexander Aristov
Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks. I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to

Re: solr keep old docs

2011-12-27 Thread Erick Erickson
Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (when uniqueKey is defined) should trigger the old document to be replaced. what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just

Re: solr keep old docs

2011-12-27 Thread Alexander Aristov
I get docs from external sources and the only place I keep them is solr index. I have no a database or other means to track indexed docs (my personal oppinion is that it might be a huge headache). Some docs might change slightly in there original sources but I don't need that changes. In fact I

solr keep old docs

2011-12-26 Thread Alexander Aristov
Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new

Re: solr keep old docs

2011-12-26 Thread Mikhail Khludnev
On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip