Re: Scheduled ManifoldCF jobs

Najman, Radko Wed, 13 Apr 2016 07:52:56 -0700

Hi Karl,

thanks. I tried the proposed patch but it didn’t work for me. After a few more 
experiments I’ve found a workaround.

It works as I expect if:

 1.  set schedule time on “Scheduling” tab in the UI and save it
 2.  set “Start method” by updating Postgres “jobs” table (update jobs set 
startmethod='B' where id=…)

If I set “Start method” on “Connection” tab and save it, it results to full 
recrawl. I don’t know why it is behaving this way, I didn’t have enough time to 
look into the source code what is happening when I click save button.

I noticed another interesting thing. I use “Start at beginning of schedule 
window” method. If I set the scheduled time for every day at 1am and I do this 
change at 10am, I would expect the jobs starts at 1am next day but it starts 
immediately. I think it should work this way for “Start even inside a schedule 
window” but for “Start at beginning of schedule window” the job should start at 
exact time. Is it correct or is my understanding to start methods wrong?

I’m running Manifold 2.1.

Thanks,
Radko

From: Karl Wright <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday 11 April 2016 at 02:22
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Scheduled ManifoldCF jobs

Here's the logic around job save (which is what would be called if you updated 
the schedule):

>>>>>>
                boolean isSame = pipelineManager.compareRows(id,jobDescription);
                if (!isSame)
                {
                  int currentStatus = 
stringToStatus((String)row.getValue(statusField));
                  if (currentStatus == STATUS_ACTIVE || currentStatus == 
STATUS_ACTIVESEEDING ||
                    currentStatus == STATUS_ACTIVE_UNINSTALLED || currentStatus 
== STATUS_ACTIVESEEDING_UNINSTALLED)

values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN));
                }

                if (isSame)
                {
                  String oldDocSpecXML = 
(String)row.getValue(documentSpecField);
                  if (!oldDocSpecXML.equals(newXML))
                    isSame = false;
                }

                if (isSame)
                  isSame = hopFilterManager.compareRows(id,jobDescription);

                if (!isSame)
                  values.put(seedingVersionField,null);
<<<<<<

So, changes to the job pipeline, or changes to the document specification, or 
changes to the hop filtering all could reset the seedingVersion field, assuming 
that it is the job save operation that is causing the full crawl.  At least, 
that is a good hypothesis.  If you think that none of these should be firing 
then we will have to figure out which one it is and why.

Unfortunately I don't have a connector I can use locally that uses versioning 
information.  I could write a test connector given time but it would not 
duplicate your pipeline environment etc.  It may be easier for you to just try 
it out in your environment with diagnostics in place.  This code is in 
JobManager.java, and I will need to know what version of MCF you have deployed. 
 I can create a ticket and attach a patch that has the needed diagnostics.  
Please let me know if that will work for you.

Thanks,
Karl

On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Even further downstream, it still all looks good:

>>>>>>
Jetty started.
Starting crawler...
Scheduled job start; requestMinimum = true
Starting job with requestMinimum = true
When starting the job, requestMinimum = true
<<<<<<

So at the moment I am at a loss.

Karl

On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Radko,

I set the same settings you did and instrumented the code.  It records the 
minimum job request:

>>>>>>
Jetty started.
Starting crawler...
Scheduled job start; requestMinimum = true
Starting job with requestMinimum = true
<<<<<<

This is the first run of the job, and the first time the schedule has been 
used, just in case you are convinced this has something to do with scheduled 
vs. non-scheduled job runs.

I am going to add more instrumentation to see if there is any chance there's a 
problem further downstream.

Karl

On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote:
Thanks a lot Karl!

Here are the steps I did:

 1.  Run the job manually – it took a few hours.
 2.  Manually “minimal" run the same job – it was done in a minute
 3.  Setup scheduled “minimal” run – it took again a few hours as in the first 
step
 4.  Scheduled runs on the other days were fast as in step 2.

Thanks for your comments, I’ll continue on it on Monday.

Have a nice weekend,
Radko

From: Karl Wright <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday 8 April 2016 at 17:18
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Scheduled ManifoldCF jobs

Also, going back in this thread a bit, let's make sure we are on the same page:

>>>>>>
I want to schedule these jobs for daily runs. I’m experiencing that the first 
scheduled run takes the same time as I ran the job for the first time manually. 
It seems it is recrawling all documents. Next scheduled runs are fast, a few 
minutes. Is it expected behaviour?
<<<<<<

If the first scheduled run is a complete crawl (meaning you did not select the 
"Minimal" setting for the schedule record), you *can* expect the job to look at 
all the documents.  The reason is because Documentum does not give us any 
information about document deletions.  We have to figure that out ourselves, 
and the only way to do it is to look at all the individual documents.  The 
documents do not have to actually be crawled, but the connector *does* need to 
at least assemble its version identifier string, which requires an interaction 
with Documentum.

So unless you have "Minimal" crawls selected everywhere, which won't ever 
detect deletions, you have to live with the time spent looking for deletions.  
We recommend that you do this at least occasionally, but certainly you wouldn't 
want to do it more than a couple times a month I would think.

Hope this helps.
Karl

On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
There's one slightly funky thing about the Documentum connector that tries to 
compensate for clock skew as follows:

>>>>>>
      // There seems to be some unexplained slop in the latest DCTM version.  
It misses documents depending on how close to the r_modify_date you happen to 
be.
      // So, I've decreased the start time by a full five minutes, to insure 
overlap.
      if (startTime > 300000L)
        startTime = startTime - 300000L;
      else
        startTime = 0L;
      StringBuilder strDQLend = new StringBuilder(" where r_modify_date >= " + 
buildDateString(startTime) +
        " and r_modify_date<=" + buildDateString(seedTime) +
        " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND a_full_text=TRUE 
AND r_content_size>0");

<<<<<<

The 300000 ms adjustment is five minutes, which doesn't seem like a lot but 
maybe it is affecting your testing?

Karl

On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Radko,

There's no magic here; the seedingversion from the database is passed to the 
connector method which seeds documents.  The only way this version gets cleared 
is if you save the job and the document specification changes.

The only other possibility I can think of is that the documentum connector is 
ignoring the seedingversion information.  I will look into this further over 
the weekend.

Karl

On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote:
Hi Karl,

thanks for your clarification.

I’m not changing any document specification information. I just set “Scheduled 
time” and “Job invocation” on “Scheduling” tab, “Start method” on “Connection” 
tab and click “Save” button. That’s all.

I tried to set all the scheduling information directly in Postres database to 
be sure I didn’t change any document specification information and the result 
was the same, all documents were recrawled.

One more thing I tried was to update “seedingversion” in “jobs” table but again 
all documents were recrawled.

Thanks,
Radko

From: Karl Wright <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday 1 April 2016 at 14:30
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Scheduled ManifoldCF jobs

Sorry, that response was *almost* incoherent. :-)

Trying again:

As far as how MCF computes incremental changes, it does not matter whether a 
job is run on schedule, or manually.  But if you change certain aspects of the 
job, namely the document specification information, MCF "starts over" at the 
beginning of time.  It needs to do that because you might well have made 
changes to the document specification that could change the way documents are 
indexed.

Thanks,
Karl

On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Hi Radko,

For computing how MCF does job crawling, it does not care whether the job is 
run manually or by schedule.

The issue is likely to be that you changed some other detail about the job 
definition that might have affected how documents are indexed.  In that case, 
MCF would cause all documents to be recrawled because of that.  Changes to a 
job's document specification information will cause that to be the case.

Thanks,
Karl

On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote:
Hello,

I have a few jobs crawling documents from Documentum. Some of these jobs are 
quite big and the first run of the job takes a few hours or a day to finish. 
Then, when I do a “minimal run” for updates, the job is usually done in a few 
minutes.

I want to schedule these jobs for daily runs. I’m experiencing that the first 
scheduled run takes the same time as I ran the job for the first time manually. 
It seems it is recrawling all documents. Next scheduled runs are fast, a few 
minutes. Is it expected behaviour? I would expect the first scheduled run to be 
fast too because the job was already finished before by manual start. Is there 
a way how to don’t recrawl all documents in this case, it’s really time 
consuming operation.

My settings:
Schedule type: Scan every document once
Job invocation: Minimal
Scheduled time: once a day
Start method: Start when schedule window starts

Thank you,
Radko
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
New Jersey, USA 07033), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.

Re: Scheduled ManifoldCF jobs

Reply via email to