Re: Scheduled ManifoldCF jobs

Karl Wright Fri, 08 Apr 2016 11:32:18 -0700

Even further downstream, it still all looks good:

>>>>>>
Jetty started.
Starting crawler...
Scheduled job start; requestMinimum = true
Starting job with requestMinimum = true
When starting the job, requestMinimum = true
<<<<<<


So at the moment I am at a loss.

Karl


On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <[email protected]> wrote:

> Hi Radko,
>
> I set the same settings you did and instrumented the code.  It records the
> minimum job request:
>
> >>>>>>
> Jetty started.
> Starting crawler...
> Scheduled job start; requestMinimum = true
> Starting job with requestMinimum = true
> <<<<<<
>
> This is the first run of the job, and the first time the schedule has been
> used, just in case you are convinced this has something to do with
> scheduled vs. non-scheduled job runs.
>
> I am going to add more instrumentation to see if there is any chance
> there's a problem further downstream.
>
> Karl
>
>
> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko <[email protected]>
> wrote:
>
>> Thanks a lot Karl!
>>
>> Here are the steps I did:
>>
>>    1. Run the job manually – it took a few hours.
>>    2. Manually “minimal" run the same job – it was done in a minute
>>    3. Setup scheduled “minimal” run – it took again a few hours as in
>>    the first step
>>    4. Scheduled runs on the other days were fast as in step 2.
>>
>> Thanks for your comments, I’ll continue on it on Monday.
>>
>> Have a nice weekend,
>> Radko
>>
>>
>>
>> From: Karl Wright <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Friday 8 April 2016 at 17:18
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Scheduled ManifoldCF jobs
>>
>> Also, going back in this thread a bit, let's make sure we are on the same
>> page:
>>
>> >>>>>>
>> I want to schedule these jobs for daily runs. I’m experiencing that the
>> first scheduled run takes the same time as I ran the job for the first time
>> manually. It seems it is recrawling all documents. Next scheduled runs are
>> fast, a few minutes. Is it expected behaviour?
>> <<<<<<
>>
>> If the first scheduled run is a complete crawl (meaning you did not
>> select the "Minimal" setting for the schedule record), you *can* expect the
>> job to look at all the documents.  The reason is because Documentum does
>> not give us any information about document deletions.  We have to figure
>> that out ourselves, and the only way to do it is to look at all the
>> individual documents.  The documents do not have to actually be crawled,
>> but the connector *does* need to at least assemble its version identifier
>> string, which requires an interaction with Documentum.
>>
>> So unless you have "Minimal" crawls selected everywhere, which won't ever
>> detect deletions, you have to live with the time spent looking for
>> deletions.  We recommend that you do this at least occasionally, but
>> certainly you wouldn't want to do it more than a couple times a month I
>> would think.
>>
>> Hope this helps.
>> Karl
>>
>>
>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]> wrote:
>>
>>> There's one slightly funky thing about the Documentum connector that
>>> tries to compensate for clock skew as follows:
>>>
>>> >>>>>>
>>>       // There seems to be some unexplained slop in the latest DCTM
>>> version.  It misses documents depending on how close to the r_modify_date
>>> you happen to be.
>>>       // So, I've decreased the start time by a full five minutes, to
>>> insure overlap.
>>>       if (startTime > 300000L)
>>>         startTime = startTime - 300000L;
>>>       else
>>>         startTime = 0L;
>>>       StringBuilder strDQLend = new StringBuilder(" where r_modify_date
>>> >= " + buildDateString(startTime) +
>>>         " and r_modify_date<=" + buildDateString(seedTime) +
>>>         " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>> a_full_text=TRUE AND r_content_size>0");
>>>
>>> <<<<<<
>>>
>>> The 300000 ms adjustment is five minutes, which doesn't seem like a lot
>>> but maybe it is affecting your testing?
>>>
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Radko,
>>>>
>>>> There's no magic here; the seedingversion from the database is passed
>>>> to the connector method which seeds documents.  The only way this version
>>>> gets cleared is if you save the job and the document specification changes.
>>>>
>>>> The only other possibility I can think of is that the documentum
>>>> connector is ignoring the seedingversion information.  I will look into
>>>> this further over the weekend.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> thanks for your clarification.
>>>>>
>>>>> I’m not changing any document specification information. I just set
>>>>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start method”
>>>>> on “Connection” tab and click “Save” button. That’s all.
>>>>>
>>>>> I tried to set all the scheduling information directly in Postres
>>>>> database to be sure I didn’t change any document specification
>>>>> information and the result was the same, all documents were recrawled.
>>>>>
>>>>> One more thing I tried was to update “seedingversion” in “jobs” table
>>>>> but again all documents were recrawled.
>>>>>
>>>>> Thanks,
>>>>> Radko
>>>>>
>>>>>
>>>>>
>>>>> From: Karl Wright <[email protected]>
>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>> Date: Friday 1 April 2016 at 14:30
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>
>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>
>>>>> Trying again:
>>>>>
>>>>> As far as how MCF computes incremental changes, it does not matter
>>>>> whether a job is run on schedule, or manually.  But if you change certain
>>>>> aspects of the job, namely the document specification information, MCF
>>>>> "starts over" at the beginning of time.  It needs to do that because you
>>>>> might well have made changes to the document specification that could
>>>>> change the way documents are indexed.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Radko,
>>>>>>
>>>>>> For computing how MCF does job crawling, it does not care whether the
>>>>>> job is run manually or by schedule.
>>>>>>
>>>>>> The issue is likely to be that you changed some other detail about
>>>>>> the job definition that might have affected how documents are indexed.  
>>>>>> In
>>>>>> that case, MCF would cause all documents to be recrawled because of that.
>>>>>> Changes to a job's document specification information will cause that to 
>>>>>> be
>>>>>> the case.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a few jobs crawling documents from Documentum. Some of these
>>>>>>> jobs are quite big and the first run of the job takes a few hours or a 
>>>>>>> day
>>>>>>> to finish. Then, when I do a “minimal run” for updates, the job is 
>>>>>>> usually
>>>>>>> done in a few minutes.
>>>>>>>
>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing that
>>>>>>> the first scheduled run takes the same time as I ran the job for the 
>>>>>>> first
>>>>>>> time manually. It seems it is recrawling all documents. Next scheduled 
>>>>>>> runs
>>>>>>> are fast, a few minutes. Is it expected behaviour? I would expect the 
>>>>>>> first
>>>>>>> scheduled run to be fast too because the job was already finished 
>>>>>>> before by
>>>>>>> manual start. Is there a way how to don’t recrawl all documents in this
>>>>>>> case, it’s really time consuming operation.
>>>>>>>
>>>>>>> My settings:
>>>>>>> Schedule type: Scan every document once
>>>>>>> Job invocation: Minimal
>>>>>>> Scheduled time: once a day
>>>>>>> Start method: Start when schedule window starts
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Radko
>>>>>>>
>>>>>>
>>>>> Notice:  This e-mail message, together with any attachments, contains
>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
>> New Jersey, USA 07033), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>
>

Re: Scheduled ManifoldCF jobs

Reply via email to