hi Karl, I think, the information coming from CMS publishing logs and from NTFS master file table is accurate :) We just need to handle it properly. What I'm missing currently:
for 1) - an option "Enable Delete for initial seeding" true/false (default "false") for 2b) - a query parameter for the JSON DELETE request: jobs/*<job_id>* ?purgeindex=*<true|false>* (default "true") I guess, it's worth doing that, because it would allow us to improve the incremental indexing enormously, e.g. several days (for file shares) vs. several dozen seconds. Thanks! -- rgds, Konstantin 2016-08-03 12:29 GMT+02:00 Karl Wright <[email protected]>: > The crawler is supposed to have an accurate idea what's been indexed. If > it doesn't then any incremental decisions it makes will probably be wrong. > It sounds like you're trying to make it work with inaccurate information, > so yes, I don't see any good way to make that work. > > Effectively you need to have to crawler be the one the fills up the index > in the first place; after that it should all be possible to do what you > want. > > Karl > > > On Wed, Aug 3, 2016 at 6:15 AM, jetnet <[email protected]> wrote: > >> Hi All, >> >> >> >> I’m trying to find a way to reduce the time spent on incremental runs of >> the crawler (HTTP, file system, file share) by creating a list of modified >> files (created/modified and deleted). >> >> The challenge is how to supply the crawler with such list? >> >> There are great interfaces (JSON API and scripting language), which could >> be used for that, but: >> >> >> >> 1) no deletion command gets sent to the index for NOT-Found (deleted >> files) entries from the modification list, if the crawler hasn’t indexed >> these files before >> >> 2a) re-using one “incremental” job: crawler would delete the previously >> indexed documents, if it they don’t appear on the modification list anymore >> >> 2b) re-creating the “incremental” job every time: crawler would delete >> ALL previous indexed docs from the index, if the job gets deleted >> >> >> >> So, currently I see no possibilities for the incremental indexing based >> on a modification list without extending the functionality of the >> framework, or maybe I missed something and there are features I’m not >> aware of? >> >> Thanks! >> >> >> -- >> >> rgds, >> >> Konstantin >> > >
