Also, going back in this thread a bit, let's make sure we are on the same page:
>>>>>> I want to schedule these jobs for daily runs. I’m experiencing that the first scheduled run takes the same time as I ran the job for the first time manually. It seems it is recrawling all documents. Next scheduled runs are fast, a few minutes. Is it expected behaviour? <<<<<< If the first scheduled run is a complete crawl (meaning you did not select the "Minimal" setting for the schedule record), you *can* expect the job to look at all the documents. The reason is because Documentum does not give us any information about document deletions. We have to figure that out ourselves, and the only way to do it is to look at all the individual documents. The documents do not have to actually be crawled, but the connector *does* need to at least assemble its version identifier string, which requires an interaction with Documentum. So unless you have "Minimal" crawls selected everywhere, which won't ever detect deletions, you have to live with the time spent looking for deletions. We recommend that you do this at least occasionally, but certainly you wouldn't want to do it more than a couple times a month I would think. Hope this helps. Karl On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]> wrote: > There's one slightly funky thing about the Documentum connector that tries > to compensate for clock skew as follows: > > >>>>>> > // There seems to be some unexplained slop in the latest DCTM > version. It misses documents depending on how close to the r_modify_date > you happen to be. > // So, I've decreased the start time by a full five minutes, to > insure overlap. > if (startTime > 300000L) > startTime = startTime - 300000L; > else > startTime = 0L; > StringBuilder strDQLend = new StringBuilder(" where r_modify_date >= > " + buildDateString(startTime) + > " and r_modify_date<=" + buildDateString(seedTime) + > " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND > a_full_text=TRUE AND r_content_size>0"); > > <<<<<< > > The 300000 ms adjustment is five minutes, which doesn't seem like a lot > but maybe it is affecting your testing? > > Karl > > > On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]> wrote: > >> Hi Radko, >> >> There's no magic here; the seedingversion from the database is passed to >> the connector method which seeds documents. The only way this version gets >> cleared is if you save the job and the document specification changes. >> >> The only other possibility I can think of is that the documentum >> connector is ignoring the seedingversion information. I will look into >> this further over the weekend. >> >> Karl >> >> >> >> >> >> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko <[email protected]> >> wrote: >> >>> Hi Karl, >>> >>> thanks for your clarification. >>> >>> I’m not changing any document specification information. I just set >>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start method” >>> on “Connection” tab and click “Save” button. That’s all. >>> >>> I tried to set all the scheduling information directly in Postres >>> database to be sure I didn’t change any document specification >>> information and the result was the same, all documents were recrawled. >>> >>> One more thing I tried was to update “seedingversion” in “jobs” table >>> but again all documents were recrawled. >>> >>> Thanks, >>> Radko >>> >>> >>> >>> From: Karl Wright <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Friday 1 April 2016 at 14:30 >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Scheduled ManifoldCF jobs >>> >>> Sorry, that response was *almost* incoherent. :-) >>> >>> Trying again: >>> >>> As far as how MCF computes incremental changes, it does not matter >>> whether a job is run on schedule, or manually. But if you change certain >>> aspects of the job, namely the document specification information, MCF >>> "starts over" at the beginning of time. It needs to do that because you >>> might well have made changes to the document specification that could >>> change the way documents are indexed. >>> >>> Thanks, >>> Karl >>> >>> >>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Radko, >>>> >>>> For computing how MCF does job crawling, it does not care whether the >>>> job is run manually or by schedule. >>>> >>>> The issue is likely to be that you changed some other detail about the >>>> job definition that might have affected how documents are indexed. In that >>>> case, MCF would cause all documents to be recrawled because of that. >>>> Changes to a job's document specification information will cause that to be >>>> the case. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote: >>>> >>>>> Hello, >>>>> >>>>> I have a few jobs crawling documents from Documentum. Some of these >>>>> jobs are quite big and the first run of the job takes a few hours or a day >>>>> to finish. Then, when I do a “minimal run” for updates, the job is usually >>>>> done in a few minutes. >>>>> >>>>> I want to schedule these jobs for daily runs. I’m experiencing that >>>>> the first scheduled run takes the same time as I ran the job for the first >>>>> time manually. It seems it is recrawling all documents. Next scheduled >>>>> runs >>>>> are fast, a few minutes. Is it expected behaviour? I would expect the >>>>> first >>>>> scheduled run to be fast too because the job was already finished before >>>>> by >>>>> manual start. Is there a way how to don’t recrawl all documents in this >>>>> case, it’s really time consuming operation. >>>>> >>>>> My settings: >>>>> Schedule type: Scan every document once >>>>> Job invocation: Minimal >>>>> Scheduled time: once a day >>>>> Start method: Start when schedule window starts >>>>> >>>>> Thank you, >>>>> Radko >>>>> >>>> Notice: This e-mail message, together with any attachments, contains >>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, >>> New Jersey, USA 07033), and/or its affiliates Direct contact information >>> for affiliates is available at >>> http://www.merck.com/contact/contacts.html) that may be confidential, >>> proprietary copyrighted and/or legally privileged. It is intended solely >>> for the use of the individual or entity named on this message. If you are >>> not the intended recipient, and have received this message in error, >>> please notify us immediately by reply e-mail and then delete it from >>> your system. >>> >> >> >
