Re: Scheduled ManifoldCF jobs

Karl Wright Fri, 15 Apr 2016 05:16:26 -0700

Hi Radko,

I'm wondering if maybe there are customizations to your version of MCF that
somehow changed the behavior?


If not, the best way forward is to take the system where we don't
understand the behavior, and try to figure out what is going on by
instrumenting the code.  That, of course, means you'd need to be operating
in a staging environment where you can perform diagnostic builds of this
kind.

MCF is unique in the Apache family in that we have to rely on users to help
us diagnose weird system integration problems.  And this kind of thing
happens all the time.  If you have the inclination and the time at some
point, maybe we can do some deeper research to figure out exactly what is
going on in your world.  I'm happy to assist if you want to take me up on
this offer.

Thanks,
Karl


On Fri, Apr 15, 2016 at 8:07 AM, Najman, Radko <[email protected]>
wrote:

> Hi Karl,
>
> thanks again for your effort. There is probably something wrong on my side
> but I have no idea what it can be. Anyway I was able to find a workaround
> and it is now working fine for me so don’t waste time with it anymore. I
> know the feeling when you are trying to solve an issue reported only by one
> user while for all others it works well.
>
> Thanks,
> Radko
>
>
> From: Karl Wright <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday 13 April 2016 at 19:42
> To: "[email protected]" <[email protected]>
> Subject: Re: Scheduled ManifoldCF jobs
>
> Hi Radko,
>
> I was able to confirm that saving a job when turning on crawling from the
> start of a schedule window does NOT reset the seeding version.  Nor does
> turning it off or changing the schedule:
>
> >>>>>>
> Jetty started.
> Starting crawler...
> NOT setting version field to null
> NOT setting version field to null
> NOT setting version field to null
> <<<<<<
>
> The code I used to test this was as follows:
>
> >>>>>>
>                 if (!isSame) {
>                   System.out.println("Setting version field to null");
>                   values.put(seedingVersionField,null);
>                 } else {
>                   System.out.println("NOT setting version field to null");
>                 }
> <<<<<<
>
> I don't know what to conclude from this.  My code here seems to be working
> perfectly.
> Karl
>
>
>
>
> On Wed, Apr 13, 2016 at 10:59 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Radko,
>>
>> >>>>>>
>> thanks. I tried the proposed patch but it didn’t work for me. After a few
>> more experiments I’ve found a workaround.
>> <<<<<<
>>
>> Hmm.  I did not send you a patch.  I just offered to create a diagnostic
>> one.  So I don't know quite what you did here.
>>
>> >>>>>>
>> If I set “Start method” on “Connection” tab and save it, it results to
>> full recrawl. I don’t know why it is behaving this way, I didn’t have
>> enough time to look into the source code what is happening when I click
>> save button.
>> <<<<<<
>>
>> I don't see any code in there that could possibly cause this, but it is
>> specific enough that I can confirm it (or not).
>>
>> >>>>>>
>> I noticed another interesting thing. I use “Start at beginning of
>> schedule window” method. If I set the scheduled time for every day at 1am 
>> and I
>> do this change at 10am, I would expect the jobs starts at 1am next day
>> but it starts immediately. I think it should work this way for “Start even
>> inside a schedule window” but for “Start at beginning of schedule window”
>> the job should start at exact time. Is it correct or is my understanding to
>> start methods wrong?
>> <<<<<<
>>
>> Your understanding is correct.  But there are integration tests that test
>> that this is working correctly, so once again I don't know why you are
>> seeing this and nobody else is.
>>
>> Karl
>>
>>
>>
>>
>> On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko wrote:
>>
>>> Hi Karl,
>>>
>>> thanks. I tried the proposed patch but it didn’t work for me. After a
>>> few more experiments I’ve found a workaround.
>>>
>>> It works as I expect if:
>>>
>>>    1. set schedule time on “Scheduling” tab in the UI and save it
>>>    2. set “Start method” by updating Postgres “jobs” table (update jobs
>>>    set startmethod='B' where id=…)
>>>
>>> If I set “Start method” on “Connection” tab and save it, it results to
>>> full recrawl. I don’t know why it is behaving this way, I didn’t have
>>> enough time to look into the source code what is happening when I click
>>> save button.
>>>
>>> I noticed another interesting thing. I use “Start at beginning of
>>> schedule window” method. If I set the scheduled time for every day at
>>> 1am and I do this change at 10am, I would expect the jobs starts at 1am
>>> next day but it starts immediately. I think it should work this way
>>> for “Start even inside a schedule window” but for “Start at beginning
>>> of schedule window” the job should start at exact time. Is it correct
>>> or is my understanding to start methods wrong?
>>>
>>> I’m running Manifold 2.1.
>>>
>>> Thanks,
>>> Radko
>>>
>>>
>>>
>>> From: Karl Wright <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Monday 11 April 2016 at 02:22
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Scheduled ManifoldCF jobs
>>>
>>> Here's the logic around job save (which is what would be called if you
>>> updated the schedule):
>>>
>>> >>>>>>
>>>                 boolean isSame =
>>> pipelineManager.compareRows(id,jobDescription);
>>>                 if (!isSame)
>>>                 {
>>>                   int currentStatus =
>>> stringToStatus((String)row.getValue(statusField));
>>>                   if (currentStatus == STATUS_ACTIVE || currentStatus ==
>>> STATUS_ACTIVESEEDING ||
>>>                     currentStatus == STATUS_ACTIVE_UNINSTALLED ||
>>> currentStatus == STATUS_ACTIVESEEDING_UNINSTALLED)
>>>
>>> values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN));
>>>                 }
>>>
>>>                 if (isSame)
>>>                 {
>>>                   String oldDocSpecXML =
>>> (String)row.getValue(documentSpecField);
>>>                   if (!oldDocSpecXML.equals(newXML))
>>>                     isSame = false;
>>>                 }
>>>
>>>                 if (isSame)
>>>                   isSame =
>>> hopFilterManager.compareRows(id,jobDescription);
>>>
>>>                 if (!isSame)
>>>                   values.put(seedingVersionField,null);
>>> <<<<<<
>>>
>>> So, changes to the job pipeline, or changes to the document
>>> specification, or changes to the hop filtering all could reset the
>>> seedingVersion field, assuming that it is the job save operation that is
>>> causing the full crawl.  At least, that is a good hypothesis.  If you think
>>> that none of these should be firing then we will have to figure out which
>>> one it is and why.
>>>
>>> Unfortunately I don't have a connector I can use locally that uses
>>> versioning information.  I could write a test connector given time but it
>>> would not duplicate your pipeline environment etc.  It may be easier for
>>> you to just try it out in your environment with diagnostics in place.  This
>>> code is in JobManager.java, and I will need to know what version of MCF you
>>> have deployed.  I can create a ticket and attach a patch that has the
>>> needed diagnostics.  Please let me know if that will work for you.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> Even further downstream, it still all looks good:
>>>>
>>>> >>>>>>
>>>> Jetty started.
>>>> Starting crawler...
>>>> Scheduled job start; requestMinimum = true
>>>> Starting job with requestMinimum = true
>>>> When starting the job, requestMinimum = true
>>>> <<<<<<
>>>>
>>>> So at the moment I am at a loss.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <[email protected]> wrote:
>>>>
>>>>> Hi Radko,
>>>>>
>>>>> I set the same settings you did and instrumented the code.  It records
>>>>> the minimum job request:
>>>>>
>>>>> >>>>>>
>>>>> Jetty started.
>>>>> Starting crawler...
>>>>> Scheduled job start; requestMinimum = true
>>>>> Starting job with requestMinimum = true
>>>>> <<<<<<
>>>>>
>>>>> This is the first run of the job, and the first time the schedule has
>>>>> been used, just in case you are convinced this has something to do with
>>>>> scheduled vs. non-scheduled job runs.
>>>>>
>>>>> I am going to add more instrumentation to see if there is any chance
>>>>> there's a problem further downstream.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote:
>>>>>
>>>>>> Thanks a lot Karl!
>>>>>>
>>>>>> Here are the steps I did:
>>>>>>
>>>>>>    1. Run the job manually – it took a few hours.
>>>>>>    2. Manually “minimal" run the same job – it was done in a minute
>>>>>>    3. Setup scheduled “minimal” run – it took again a few hours as
>>>>>>    in the first step
>>>>>>    4. Scheduled runs on the other days were fast as in step 2.
>>>>>>
>>>>>> Thanks for your comments, I’ll continue on it on Monday.
>>>>>>
>>>>>> Have a nice weekend,
>>>>>> Radko
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: Karl Wright <[email protected]>
>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>> Date: Friday 8 April 2016 at 17:18
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>>
>>>>>> Also, going back in this thread a bit, let's make sure we are on the
>>>>>> same page:
>>>>>>
>>>>>> >>>>>>
>>>>>> I want to schedule these jobs for daily runs. I’m experiencing that
>>>>>> the first scheduled run takes the same time as I ran the job for the 
>>>>>> first
>>>>>> time manually. It seems it is recrawling all documents. Next scheduled 
>>>>>> runs
>>>>>> are fast, a few minutes. Is it expected behaviour?
>>>>>> <<<<<<
>>>>>>
>>>>>> If the first scheduled run is a complete crawl (meaning you did not
>>>>>> select the "Minimal" setting for the schedule record), you *can* expect 
>>>>>> the
>>>>>> job to look at all the documents.  The reason is because Documentum does
>>>>>> not give us any information about document deletions.  We have to figure
>>>>>> that out ourselves, and the only way to do it is to look at all the
>>>>>> individual documents.  The documents do not have to actually be crawled,
>>>>>> but the connector *does* need to at least assemble its version identifier
>>>>>> string, which requires an interaction with Documentum.
>>>>>>
>>>>>> So unless you have "Minimal" crawls selected everywhere, which won't
>>>>>> ever detect deletions, you have to live with the time spent looking for
>>>>>> deletions.  We recommend that you do this at least occasionally, but
>>>>>> certainly you wouldn't want to do it more than a couple times a month I
>>>>>> would think.
>>>>>>
>>>>>> Hope this helps.
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> There's one slightly funky thing about the Documentum connector that
>>>>>>> tries to compensate for clock skew as follows:
>>>>>>>
>>>>>>> >>>>>>
>>>>>>>       // There seems to be some unexplained slop in the latest DCTM
>>>>>>> version.  It misses documents depending on how close to the 
>>>>>>> r_modify_date
>>>>>>> you happen to be.
>>>>>>>       // So, I've decreased the start time by a full five minutes,
>>>>>>> to insure overlap.
>>>>>>>       if (startTime > 300000L)
>>>>>>>         startTime = startTime - 300000L;
>>>>>>>       else
>>>>>>>         startTime = 0L;
>>>>>>>       StringBuilder strDQLend = new StringBuilder(" where
>>>>>>> r_modify_date >= " + buildDateString(startTime) +
>>>>>>>         " and r_modify_date<=" + buildDateString(seedTime) +
>>>>>>>         " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND
>>>>>>> a_full_text=TRUE AND r_content_size>0");
>>>>>>>
>>>>>>> <<<<<<
>>>>>>>
>>>>>>> The 300000 ms adjustment is five minutes, which doesn't seem like a
>>>>>>> lot but maybe it is affecting your testing?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Radko,
>>>>>>>>
>>>>>>>> There's no magic here; the seedingversion from the database is
>>>>>>>> passed to the connector method which seeds documents.  The only way 
>>>>>>>> this
>>>>>>>> version gets cleared is if you save the job and the document 
>>>>>>>> specification
>>>>>>>> changes.
>>>>>>>>
>>>>>>>> The only other possibility I can think of is that the documentum
>>>>>>>> connector is ignoring the seedingversion information.  I will look into
>>>>>>>> this further over the weekend.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
>>>>>>>>
>>>>>>>>> Hi Karl,
>>>>>>>>>
>>>>>>>>> thanks for your clarification.
>>>>>>>>>
>>>>>>>>> I’m not changing any document specification information. I just
>>>>>>>>> set “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start
>>>>>>>>> method” on “Connection” tab and click “Save” button. That’s all.
>>>>>>>>>
>>>>>>>>> I tried to set all the scheduling information directly in Postres
>>>>>>>>> database to be sure I didn’t change any document specification
>>>>>>>>> information and the result was the same, all documents were
>>>>>>>>> recrawled.
>>>>>>>>>
>>>>>>>>> One more thing I tried was to update “seedingversion” in “jobs”
>>>>>>>>> table but again all documents were recrawled.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Radko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> From: Karl Wright <[email protected]>
>>>>>>>>> Reply-To: "[email protected]" <[email protected]
>>>>>>>>> >
>>>>>>>>> Date: Friday 1 April 2016 at 14:30
>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>> Subject: Re: Scheduled ManifoldCF jobs
>>>>>>>>>
>>>>>>>>> Sorry, that response was *almost* incoherent. :-)
>>>>>>>>>
>>>>>>>>> Trying again:
>>>>>>>>>
>>>>>>>>> As far as how MCF computes incremental changes, it does not matter
>>>>>>>>> whether a job is run on schedule, or manually.  But if you change 
>>>>>>>>> certain
>>>>>>>>> aspects of the job, namely the document specification information, MCF
>>>>>>>>> "starts over" at the beginning of time.  It needs to do that because 
>>>>>>>>> you
>>>>>>>>> might well have made changes to the document specification that could
>>>>>>>>> change the way documents are indexed.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Radko,
>>>>>>>>>>
>>>>>>>>>> For computing how MCF does job crawling, it does not care whether
>>>>>>>>>> the job is run manually or by schedule.
>>>>>>>>>>
>>>>>>>>>> The issue is likely to be that you changed some other detail
>>>>>>>>>> about the job definition that might have affected how documents are
>>>>>>>>>> indexed.  In that case, MCF would cause all documents to be recrawled
>>>>>>>>>> because of that.  Changes to a job's document specification 
>>>>>>>>>> information
>>>>>>>>>> will cause that to be the case.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> I have a few jobs crawling documents from Documentum. Some of
>>>>>>>>>>> these jobs are quite big and the first run of the job takes a few 
>>>>>>>>>>> hours or
>>>>>>>>>>> a day to finish. Then, when I do a “minimal run” for updates, the 
>>>>>>>>>>> job is
>>>>>>>>>>> usually done in a few minutes.
>>>>>>>>>>>
>>>>>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing
>>>>>>>>>>> that the first scheduled run takes the same time as I ran the job 
>>>>>>>>>>> for the
>>>>>>>>>>> first time manually. It seems it is recrawling all documents. Next
>>>>>>>>>>> scheduled runs are fast, a few minutes. Is it expected behaviour? I 
>>>>>>>>>>> would
>>>>>>>>>>> expect the first scheduled run to be fast too because the job was 
>>>>>>>>>>> already
>>>>>>>>>>> finished before by manual start. Is there a way how to don’t 
>>>>>>>>>>> recrawl all
>>>>>>>>>>> documents in this case, it’s really time consuming operation.
>>>>>>>>>>>
>>>>>>>>>>> My settings:
>>>>>>>>>>> Schedule type: Scan every document once
>>>>>>>>>>> Job invocation: Minimal
>>>>>>>>>>> Scheduled time: once a day
>>>>>>>>>>> Start method: Start when schedule window starts
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Radko
>>>>>>>>>>>
>>>>>>>>>> Notice:  This e-mail message, together with any attachments,
> contains
> information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
> New Jersey, USA 07033), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>

Re: Scheduled ManifoldCF jobs

Reply via email to