The crawl it does depends on the model the connector uses. If the connector does not get informed about deletes, then the connector is *forced* to find them by checking each document to see if it went away. Most repositories do not have this capability, I'm afraid.
There are some ways you can get around this. The best is by using a continuous crawl with expiration. ManifoldCF will then just do some of the crawl during each time window it is given and never try to clean up dead documents, other than by expiring them. You can read more about the various crawl models in ManifoldCF in Action. Karl On Tue, Feb 12, 2013 at 11:55 AM, Mark Lugert <[email protected]> wrote: > If you have 3 million documents then each time you run a crawl it will check > each document that matches your query correct? > > Just want to make sure I understand. That could really take a lot of time. > > Wouldn't it be better to store a last crawled date and then limit the query > based on that date so your only indexing things the repo server says have > changed? The current method seems better suited to things like > websites/wikis where you can't really query based on modified dates. > > -mark > > From: Karl Wright <[email protected]> > To: [email protected]; Mark Lugert <[email protected]> > Sent: Monday, February 11, 2013 5:10 PM > Subject: Re: new documents > > Actually, it doesn't reindex everything. It only reindexes those > documents that have "changed", using the connector's idea of what that > means. For SharePoint, it's the modify date, for Alfresco and CMIS I > don't know but others on this list might. > > Also, don't confuse rechecking with reindexing. ManifoldCF *will* > need to scan through the documents in many cases, but it will do a > minimal amount of work for each one. > > Karl > > On Mon, Feb 11, 2013 at 3:35 PM, Mark Lugert <[email protected]> wrote: >> Hi Karl, >> >> If I use the sharepoint, alfresco, or cmis repo connectors how can I make >> it only index new documents that match my queries? >> >> Right now I'm seeing it reindex everything that matches my query every >> time >> the job runs. >> >> I have it set to scan all documents once, but still rescans everything >> every >> time I start the job. Is this a config issue on my part? >> >> thanks, >> Mark > >
