Re: Performing DIH on predefined list of IDS

Mikhail Khludnev Fri, 20 Feb 2015 12:58:46 -0800

It's a little bit hard to get the overall context eg why do you live with
OOME as usual, what's the reasoning to pull from one index to another, and
what's added during this process.


Make sure that you are aware of
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which
queries other Solr. and
http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can
use to log recently imported ids, to be able to restart indexing from this
point.
You can drop me more details in your native language if you wish.

On Fri, Feb 20, 2015 at 1:32 PM, SolrUser1543 <osta...@gmail.com> wrote:

> Relatively frequently (about a once a month) we need to reindex the data,
> by
> using DIH and copying the data from one index to another.
> Because of the fact that we have a large index, it could take from 12 to 24
> hours to complete. At the same time the old index is being queried by
> users.
> Sometimes DIH could be interrupted at the middle, because of some
> unexpected
> exception caused by OutOfMemory or something else (many times it failed
> when
> more than 90 % was completed).
> More than this, almost every time, some items are missing at new the
> index.
> It is very complicated to find them.
> At this stage I can't be sure about what documents exactly were missed and
> I
> have to do it again and waiting for many hours. At the same time the old
> index constantly receives new items.
>
> I want to suggest the following way to solve the problem:
> •       Get list of all item ids ( call LUCINE API , like CLUE does for
> example )
> •       Start DIH    , which will iterate over those ids and each time
> make a
> query for n items.
> 1.      Of course original DIH class should be changed to support it.
> •       This will give the following advantages :
> 1.      I will know exactly what items were failed.
> 2.      I can restart the process from any point and in case of DIH failure
> restart it from the point of failure.
>
>
> so the main difference will be that now DIH running on *:* query and I
> suggest to run it list of IDS
>
> for example if I have 1000 docs and want that this new DIH will take each
> time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
> like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )
>
> The question is what do you think about it? Or all of this could be done
> another way and I am trying to reinvent the wheel?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhlud...@griddynamics.com>

Re: Performing DIH on predefined list of IDS

Reply via email to