Efficient delta (incremental) indexing

jetnet Wed, 03 Aug 2016 03:16:15 -0700

Hi All,



I’m trying to find a way to reduce the time spent on incremental runs of
the crawler (HTTP, file system, file share) by creating a list of modified
files (created/modified and deleted).

The challenge is how to supply the crawler with such list?

There are great interfaces (JSON API and scripting language), which could
be used for that, but:



1) no deletion command gets sent to the index for NOT-Found (deleted files)
entries from the modification list, if the crawler hasn’t indexed these
files before

2a) re-using one “incremental” job: crawler would delete the previously
indexed documents, if it they don’t appear on the modification list anymore

2b) re-creating the “incremental” job every time: crawler would delete ALL
previous indexed docs from the index, if the job gets deleted



So, currently I see no possibilities for the incremental indexing based on
a modification list without extending the functionality of the framework,
or maybe I missed something and there are features  I’m not aware of?

Thanks!


--

rgds,

Konstantin

Efficient delta (incremental) indexing

Reply via email to