Hi Radko,
I was able to confirm that saving a job when turning on crawling from the
start of a schedule window does NOT reset the seeding version. Nor does
turning it off or changing the schedule:
>>>>>>
Jetty started.
Starting crawler...
NOT setting version field to null
NOT setting version field to null
NOT setting version field to null
<<<<<<
The code I used to test this was as follows:
>>>>>>
if (!isSame) {
System.out.println("Setting version field to null");
values.put(seedingVersionField,null);
} else {
System.out.println("NOT setting version field to null");
}
<<<<<<
I don't know what to conclude from this. My code here seems to be working
perfectly.
Karl
On Wed, Apr 13, 2016 at 10:59 AM, Karl Wright <[email protected]> wrote:
> Hi Radko,
>
> >>>>>>
> thanks. I tried the proposed patch but it didn’t work for me. After a few
> more experiments I’ve found a workaround.
> <<<<<<
>
> Hmm. I did not send you a patch. I just offered to create a diagnostic
> one. So I don't know quite what you did here.
>
> >>>>>>
> If I set “Start method” on “Connection” tab and save it, it results to
> full recrawl. I don’t know why it is behaving this way, I didn’t have
> enough time to look into the source code what is happening when I click
> save button.
> <<<<<<
>
> I don't see any code in there that could possibly cause this, but it is
> specific enough that I can confirm it (or not).
>
> >>>>>>
> I noticed another interesting thing. I use “Start at beginning of schedule
> window” method. If I set the scheduled time for every day at 1am and I do
> this change at 10am, I would expect the jobs starts at 1am next day but
> it starts immediately. I think it should work this way for “Start even
> inside a schedule window” but for “Start at beginning of schedule window”
> the job should start at exact time. Is it correct or is my understanding to
> start methods wrong?
> <<<<<<
>
> Your understanding is correct. But there are integration tests that test
> that this is working correctly, so once again I don't know why you are
> seeing this and nobody else is.
>
> Karl
>
>
>
>
> On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko <[email protected]>
> wrote:
>
>> Hi Karl,
>>
>> thanks. I tried the proposed patch but it didn’t work for me. After a few
>> more experiments I’ve found a workaround.
>>
>> It works as I expect if:
>>
>> 1. set schedule time on “Scheduling” tab in the UI and save it
>> 2. set “Start method” by updating Postgres “jobs” table (update jobs
>> set startmethod='B' where id=…)
>>
>> If I set “Start method” on “Connection” tab and save it, it results to
>> full recrawl. I don’t know why it is behaving this way, I didn’t have
>> enough time to look into the source code what is happening when I click
>> save button.
>>
>> I noticed another interesting thing. I use “Start at beginning of
>> schedule window” method. If I set the scheduled time for every day at
>> 1am and I do this change at 10am, I would expect the jobs starts at 1am
>> next day but it starts immediately. I think it should work this way
>> for “Start even inside a schedule window” but for “Start at beginning of
>> schedule window” the job should start at exact time. Is it correct or is
>> my understanding to start methods wrong?
>>
>> I’m running Manifold 2.1.
>>
>> Thanks,
>> Radko
>>
>>
>>
>> From: Karl Wright <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Monday 11 April 2016 at 02:22
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Scheduled ManifoldCF jobs
>>
>> Here's the logic around job save (which is what would be called if you
>> updated the schedule):
>>
>> >>>>>>
>> boolean isSame =
>> pipelineManager.compareRows(id,jobDescription);
>> if (!isSame)
>> {
>> int currentStatus =
>> stringToStatus((String)row.getValue(statusField));
>> if (currentStatus == STATUS_ACTIVE || currentStatus ==
>> STATUS_ACTIVESEEDING ||
>> currentStatus == STATUS_ACTIVE_UNINSTALLED ||
>> currentStatus == STATUS_ACTIVESEEDING_UNINSTALLED)
>>
>> values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN));
>> }
>>
>> if (isSame)
>> {
>> String oldDocSpecXML =
>> (String)row.getValue(documentSpecField);
>> if (!oldDocSpecXML.equals(newXML))
>> isSame = false;
>> }
>>
>> if (isSame)
>> isSame =
>> hopFilterManager.compareRows(id,jobDescription);
>>
>> if (!isSame)
>> values.put(seedingVersionField,null);
>> <<<<<<
>>
>> So, changes to the job pipeline, or changes to the document
>> specification, or changes to the hop filtering all could reset the
>> seedingVersion field, assuming that it is the job save operation that is
>> causing the full crawl. At least, that is a good hypothesis. If you think
>> that none of these should be firing then we will have to figure out which
>> one it is and why.
>>
>> Unfortunately I don't have a connector I can use locally that uses
>> versioning information. I could write a test connector given time but it
>> would not duplicate your pipeline environment etc. It may be easier for
>> you to just try it out in your environment with diagnostics in place. This
>> code is in JobManager.java, and I will need to know what version of MCF you
>> have deployed. I can create a ticket and attach a patch that has the
>> needed diagnostics. Please let me know if that will work for you.
>>
>> Thanks,
>> Karl
>>
>>
>> On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright <[email protected]> wrote:
>>
>>> Even further downstream, it still all looks good:
>>>
>>> >>>>>>
>>> Jetty started.
>>> Starting crawler...
>>> Scheduled job start; requestMinimum = true
>>> Starting job with requestMinimum = true
>>> When starting the job, requestMinimum = true
>>> <<<<<<
>>>
>>> So at the moment I am at a loss.
>>>
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Radko,
>>>>
>>>> I set the same settings you did and instrumented the code. It records
>>>> the minimum job request:
>>>>
>>>> >>>>>>
>>>> Jetty started.
>>>> Starting crawler...
>>>> Scheduled job start; requestMinimum = true
>>>> Starting job with requestMinimum = true
>>>> <<<<<<
>>>>
>>>> This is the first run of the job, and the first time the schedule has
>>>> been used, just in case you are convinced this has something to do with
>>>> scheduled vs. non-scheduled job runs.
>>>>
>>>> I am going to add more instrumentation to see if there is any chance
>>>> there's a problem further downstream.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote:
>>>>
>>>>> Thanks a lot Karl!
>>>>>
>>>>> Here are the steps I did:
>>>>>
>>>>> 1. Run the job manually – it took a few hours.
>>>>> 2. Manually “minimal" run the same job – it was done in a minute
>>>>> 3. Setup scheduled “minimal” run – it took again a few hours as in
>>>>> the first step
>>>>> 4. Scheduled runs on the other days were fast as in step 2.
>>>>>
>>>>> Thanks for your comments, I’ll continue on it on Monday.
>>>>>
>>>>> Have a nice weekend,
>>>>> Radko
>>>>>
>>>>>
>>>>>
>>>>> From: Karl Wright <[email protected]>
>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>> Date: Friday 8 April 2016 at 17:18
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>
>>>>> Also, going back in this thread a bit, let's make sure we are on the
>>>>> same page:
>>>>>
>>>>> >>>>>>
>>>>> I want to schedule these jobs for daily runs. I’m experiencing that
>>>>> the first scheduled run takes the same time as I ran the job for the first
>>>>> time manually. It seems it is recrawling all documents. Next scheduled
>>>>> runs
>>>>> are fast, a few minutes. Is it expected behaviour?
>>>>> <<<<<<
>>>>>
>>>>> If the first scheduled run is a complete crawl (meaning you did not
>>>>> select the "Minimal" setting for the schedule record), you *can* expect
>>>>> the
>>>>> job to look at all the documents. The reason is because Documentum does
>>>>> not give us any information about document deletions. We have to figure
>>>>> that out ourselves, and the only way to do it is to look at all the
>>>>> individual documents. The documents do not have to actually be crawled,
>>>>> but the connector *does* need to at least assemble its version identifier
>>>>> string, which requires an interaction with Documentum.
>>>>>
>>>>> So unless you have "Minimal" crawls selected everywhere, which won't
>>>>> ever detect deletions, you have to live with the time spent looking for
>>>>> deletions. We recommend that you do this at least occasionally, but
>>>>> certainly you wouldn't want to do it more than a couple times a month I
>>>>> would think.
>>>>>
>>>>> Hope this helps.
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> There's one slightly funky thing about the Documentum connector that
>>>>>> tries to compensate for clock skew as follows:
>>>>>>
>>>>>> >>>>>>
>>>>>> // There seems to be some unexplained slop in the latest DCTM
>>>>>> version. It misses documents depending on how close to the r_modify_date
>>>>>> you happen to be.
>>>>>> // So, I've decreased the start time by a full five minutes, to
>>>>>> insure overlap.
>>>>>> if (startTime > 300000L)
>>>>>> startTime = startTime - 300000L;
>>>>>> else
>>>>>> startTime = 0L;
>>>>>> StringBuilder strDQLend = new StringBuilder(" where
>>>>>> r_modify_date >= " + buildDateString(startTime) +
>>>>>> " and r_modify_date<=" + buildDateString(seedTime) +
>>>>>> " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>>>>> a_full_text=TRUE AND r_content_size>0");
>>>>>>
>>>>>> <<<<<<
>>>>>>
>>>>>> The 300000 ms adjustment is five minutes, which doesn't seem like a
>>>>>> lot but maybe it is affecting your testing?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Radko,
>>>>>>>
>>>>>>> There's no magic here; the seedingversion from the database is
>>>>>>> passed to the connector method which seeds documents. The only way this
>>>>>>> version gets cleared is if you save the job and the document
>>>>>>> specification
>>>>>>> changes.
>>>>>>>
>>>>>>> The only other possibility I can think of is that the documentum
>>>>>>> connector is ignoring the seedingversion information. I will look into
>>>>>>> this further over the weekend.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>>>>
>>>>>>>> Hi Karl,
>>>>>>>>
>>>>>>>> thanks for your clarification.
>>>>>>>>
>>>>>>>> I’m not changing any document specification information. I just set
>>>>>>>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start
>>>>>>>> method”
>>>>>>>> on “Connection” tab and click “Save” button. That’s all.
>>>>>>>>
>>>>>>>> I tried to set all the scheduling information directly in Postres
>>>>>>>> database to be sure I didn’t change any document specification
>>>>>>>> information and the result was the same, all documents were
>>>>>>>> recrawled.
>>>>>>>>
>>>>>>>> One more thing I tried was to update “seedingversion” in “jobs”
>>>>>>>> table but again all documents were recrawled.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Radko
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> From: Karl Wright <[email protected]>
>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>> Date: Friday 1 April 2016 at 14:30
>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>>>>
>>>>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>>>>
>>>>>>>> Trying again:
>>>>>>>>
>>>>>>>> As far as how MCF computes incremental changes, it does not matter
>>>>>>>> whether a job is run on schedule, or manually. But if you change
>>>>>>>> certain
>>>>>>>> aspects of the job, namely the document specification information, MCF
>>>>>>>> "starts over" at the beginning of time. It needs to do that because
>>>>>>>> you
>>>>>>>> might well have made changes to the document specification that could
>>>>>>>> change the way documents are indexed.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Radko,
>>>>>>>>>
>>>>>>>>> For computing how MCF does job crawling, it does not care whether
>>>>>>>>> the job is run manually or by schedule.
>>>>>>>>>
>>>>>>>>> The issue is likely to be that you changed some other detail about
>>>>>>>>> the job definition that might have affected how documents are
>>>>>>>>> indexed. In
>>>>>>>>> that case, MCF would cause all documents to be recrawled because of
>>>>>>>>> that.
>>>>>>>>> Changes to a job's document specification information will cause that
>>>>>>>>> to be
>>>>>>>>> the case.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I have a few jobs crawling documents from Documentum. Some of
>>>>>>>>>> these jobs are quite big and the first run of the job takes a few
>>>>>>>>>> hours or
>>>>>>>>>> a day to finish. Then, when I do a “minimal run” for updates, the
>>>>>>>>>> job is
>>>>>>>>>> usually done in a few minutes.
>>>>>>>>>>
>>>>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing
>>>>>>>>>> that the first scheduled run takes the same time as I ran the job
>>>>>>>>>> for the
>>>>>>>>>> first time manually. It seems it is recrawling all documents. Next
>>>>>>>>>> scheduled runs are fast, a few minutes. Is it expected behaviour? I
>>>>>>>>>> would
>>>>>>>>>> expect the first scheduled run to be fast too because the job was
>>>>>>>>>> already
>>>>>>>>>> finished before by manual start. Is there a way how to don’t recrawl
>>>>>>>>>> all
>>>>>>>>>> documents in this case, it’s really time consuming operation.
>>>>>>>>>>
>>>>>>>>>> My settings:
>>>>>>>>>> Schedule type: Scan every document once
>>>>>>>>>> Job invocation: Minimal
>>>>>>>>>> Scheduled time: once a day
>>>>>>>>>> Start method: Start when schedule window starts
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Radko
>>>>>>>>>>
>>>>>>>>> Notice: This e-mail message, together with any attachments,
>> contains
>> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
>> New Jersey, USA 07033), and/or its affiliates Direct contact information
>> for affiliates is available at
>> http://www.merck.com/contact/contacts.html) that may be confidential,
>> proprietary copyrighted and/or legally privileged. It is intended solely
>> for the use of the individual or entity named on this message. If you are
>> not the intended recipient, and have received this message in error,
>> please notify us immediately by reply e-mail and then delete it from
>> your system.
>>
>
>