Even further downstream, it still all looks good: >>>>>> Jetty started. Starting crawler... Scheduled job start; requestMinimum = true Starting job with requestMinimum = true When starting the job, requestMinimum = true <<<<<<
So at the moment I am at a loss. Karl On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <[email protected]> wrote: > Hi Radko, > > I set the same settings you did and instrumented the code. It records the > minimum job request: > > >>>>>> > Jetty started. > Starting crawler... > Scheduled job start; requestMinimum = true > Starting job with requestMinimum = true > <<<<<< > > This is the first run of the job, and the first time the schedule has been > used, just in case you are convinced this has something to do with > scheduled vs. non-scheduled job runs. > > I am going to add more instrumentation to see if there is any chance > there's a problem further downstream. > > Karl > > > On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko <[email protected]> > wrote: > >> Thanks a lot Karl! >> >> Here are the steps I did: >> >> 1. Run the job manually – it took a few hours. >> 2. Manually “minimal" run the same job – it was done in a minute >> 3. Setup scheduled “minimal” run – it took again a few hours as in >> the first step >> 4. Scheduled runs on the other days were fast as in step 2. >> >> Thanks for your comments, I’ll continue on it on Monday. >> >> Have a nice weekend, >> Radko >> >> >> >> From: Karl Wright <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Friday 8 April 2016 at 17:18 >> To: "[email protected]" <[email protected]> >> Subject: Re: Scheduled ManifoldCF jobs >> >> Also, going back in this thread a bit, let's make sure we are on the same >> page: >> >> >>>>>> >> I want to schedule these jobs for daily runs. I’m experiencing that the >> first scheduled run takes the same time as I ran the job for the first time >> manually. It seems it is recrawling all documents. Next scheduled runs are >> fast, a few minutes. Is it expected behaviour? >> <<<<<< >> >> If the first scheduled run is a complete crawl (meaning you did not >> select the "Minimal" setting for the schedule record), you *can* expect the >> job to look at all the documents. The reason is because Documentum does >> not give us any information about document deletions. We have to figure >> that out ourselves, and the only way to do it is to look at all the >> individual documents. The documents do not have to actually be crawled, >> but the connector *does* need to at least assemble its version identifier >> string, which requires an interaction with Documentum. >> >> So unless you have "Minimal" crawls selected everywhere, which won't ever >> detect deletions, you have to live with the time spent looking for >> deletions. We recommend that you do this at least occasionally, but >> certainly you wouldn't want to do it more than a couple times a month I >> would think. >> >> Hope this helps. >> Karl >> >> >> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]> wrote: >> >>> There's one slightly funky thing about the Documentum connector that >>> tries to compensate for clock skew as follows: >>> >>> >>>>>> >>> // There seems to be some unexplained slop in the latest DCTM >>> version. It misses documents depending on how close to the r_modify_date >>> you happen to be. >>> // So, I've decreased the start time by a full five minutes, to >>> insure overlap. >>> if (startTime > 300000L) >>> startTime = startTime - 300000L; >>> else >>> startTime = 0L; >>> StringBuilder strDQLend = new StringBuilder(" where r_modify_date >>> >= " + buildDateString(startTime) + >>> " and r_modify_date<=" + buildDateString(seedTime) + >>> " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND >>> a_full_text=TRUE AND r_content_size>0"); >>> >>> <<<<<< >>> >>> The 300000 ms adjustment is five minutes, which doesn't seem like a lot >>> but maybe it is affecting your testing? >>> >>> Karl >>> >>> >>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]> wrote: >>> >>>> Hi Radko, >>>> >>>> There's no magic here; the seedingversion from the database is passed >>>> to the connector method which seeds documents. The only way this version >>>> gets cleared is if you save the job and the document specification changes. >>>> >>>> The only other possibility I can think of is that the documentum >>>> connector is ignoring the seedingversion information. I will look into >>>> this further over the weekend. >>>> >>>> Karl >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> thanks for your clarification. >>>>> >>>>> I’m not changing any document specification information. I just set >>>>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start method” >>>>> on “Connection” tab and click “Save” button. That’s all. >>>>> >>>>> I tried to set all the scheduling information directly in Postres >>>>> database to be sure I didn’t change any document specification >>>>> information and the result was the same, all documents were recrawled. >>>>> >>>>> One more thing I tried was to update “seedingversion” in “jobs” table >>>>> but again all documents were recrawled. >>>>> >>>>> Thanks, >>>>> Radko >>>>> >>>>> >>>>> >>>>> From: Karl Wright <[email protected]> >>>>> Reply-To: "[email protected]" <[email protected]> >>>>> Date: Friday 1 April 2016 at 14:30 >>>>> To: "[email protected]" <[email protected]> >>>>> Subject: Re: Scheduled ManifoldCF jobs >>>>> >>>>> Sorry, that response was *almost* incoherent. :-) >>>>> >>>>> Trying again: >>>>> >>>>> As far as how MCF computes incremental changes, it does not matter >>>>> whether a job is run on schedule, or manually. But if you change certain >>>>> aspects of the job, namely the document specification information, MCF >>>>> "starts over" at the beginning of time. It needs to do that because you >>>>> might well have made changes to the document specification that could >>>>> change the way documents are indexed. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Radko, >>>>>> >>>>>> For computing how MCF does job crawling, it does not care whether the >>>>>> job is run manually or by schedule. >>>>>> >>>>>> The issue is likely to be that you changed some other detail about >>>>>> the job definition that might have affected how documents are indexed. >>>>>> In >>>>>> that case, MCF would cause all documents to be recrawled because of that. >>>>>> Changes to a job's document specification information will cause that to >>>>>> be >>>>>> the case. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I have a few jobs crawling documents from Documentum. Some of these >>>>>>> jobs are quite big and the first run of the job takes a few hours or a >>>>>>> day >>>>>>> to finish. Then, when I do a “minimal run” for updates, the job is >>>>>>> usually >>>>>>> done in a few minutes. >>>>>>> >>>>>>> I want to schedule these jobs for daily runs. I’m experiencing that >>>>>>> the first scheduled run takes the same time as I ran the job for the >>>>>>> first >>>>>>> time manually. It seems it is recrawling all documents. Next scheduled >>>>>>> runs >>>>>>> are fast, a few minutes. Is it expected behaviour? I would expect the >>>>>>> first >>>>>>> scheduled run to be fast too because the job was already finished >>>>>>> before by >>>>>>> manual start. Is there a way how to don’t recrawl all documents in this >>>>>>> case, it’s really time consuming operation. >>>>>>> >>>>>>> My settings: >>>>>>> Schedule type: Scan every document once >>>>>>> Job invocation: Minimal >>>>>>> Scheduled time: once a day >>>>>>> Start method: Start when schedule window starts >>>>>>> >>>>>>> Thank you, >>>>>>> Radko >>>>>>> >>>>>> >>>>> Notice: This e-mail message, together with any attachments, contains >> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, >> New Jersey, USA 07033), and/or its affiliates Direct contact information >> for affiliates is available at >> http://www.merck.com/contact/contacts.html) that may be confidential, >> proprietary copyrighted and/or legally privileged. It is intended solely >> for the use of the individual or entity named on this message. If you are >> not the intended recipient, and have received this message in error, >> please notify us immediately by reply e-mail and then delete it from >> your system. >> > >
