Hi Radko, >>>>>> thanks. I tried the proposed patch but it didn’t work for me. After a few more experiments I’ve found a workaround. <<<<<<
Hmm. I did not send you a patch. I just offered to create a diagnostic one. So I don't know quite what you did here. >>>>>> If I set “Start method” on “Connection” tab and save it, it results to full recrawl. I don’t know why it is behaving this way, I didn’t have enough time to look into the source code what is happening when I click save button. <<<<<< I don't see any code in there that could possibly cause this, but it is specific enough that I can confirm it (or not). >>>>>> I noticed another interesting thing. I use “Start at beginning of schedule window” method. If I set the scheduled time for every day at 1am and I do this change at 10am, I would expect the jobs starts at 1am next day but it starts immediately. I think it should work this way for “Start even inside a schedule window” but for “Start at beginning of schedule window” the job should start at exact time. Is it correct or is my understanding to start methods wrong? <<<<<< Your understanding is correct. But there are integration tests that test that this is working correctly, so once again I don't know why you are seeing this and nobody else is. Karl On Wed, Apr 13, 2016 at 10:49 AM, Najman, Radko <[email protected]> wrote: > Hi Karl, > > thanks. I tried the proposed patch but it didn’t work for me. After a few > more experiments I’ve found a workaround. > > It works as I expect if: > > 1. set schedule time on “Scheduling” tab in the UI and save it > 2. set “Start method” by updating Postgres “jobs” table (update jobs > set startmethod='B' where id=…) > > If I set “Start method” on “Connection” tab and save it, it results to > full recrawl. I don’t know why it is behaving this way, I didn’t have > enough time to look into the source code what is happening when I click > save button. > > I noticed another interesting thing. I use “Start at beginning of > schedule window” method. If I set the scheduled time for every day at 1am > and I do this change at 10am, I would expect the jobs starts at 1am next > day but it starts immediately. I think it should work this way for “Start > even inside a schedule window” but for “Start at beginning of schedule > window” the job should start at exact time. Is it correct or is my > understanding to start methods wrong? > > I’m running Manifold 2.1. > > Thanks, > Radko > > > > From: Karl Wright <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Monday 11 April 2016 at 02:22 > To: "[email protected]" <[email protected]> > Subject: Re: Scheduled ManifoldCF jobs > > Here's the logic around job save (which is what would be called if you > updated the schedule): > > >>>>>> > boolean isSame = > pipelineManager.compareRows(id,jobDescription); > if (!isSame) > { > int currentStatus = > stringToStatus((String)row.getValue(statusField)); > if (currentStatus == STATUS_ACTIVE || currentStatus == > STATUS_ACTIVESEEDING || > currentStatus == STATUS_ACTIVE_UNINSTALLED || > currentStatus == STATUS_ACTIVESEEDING_UNINSTALLED) > > values.put(assessmentStateField,assessmentStateToString(ASSESSMENT_UNKNOWN)); > } > > if (isSame) > { > String oldDocSpecXML = > (String)row.getValue(documentSpecField); > if (!oldDocSpecXML.equals(newXML)) > isSame = false; > } > > if (isSame) > isSame = hopFilterManager.compareRows(id,jobDescription); > > if (!isSame) > values.put(seedingVersionField,null); > <<<<<< > > So, changes to the job pipeline, or changes to the document specification, > or changes to the hop filtering all could reset the seedingVersion field, > assuming that it is the job save operation that is causing the full crawl. > At least, that is a good hypothesis. If you think that none of these > should be firing then we will have to figure out which one it is and why. > > Unfortunately I don't have a connector I can use locally that uses > versioning information. I could write a test connector given time but it > would not duplicate your pipeline environment etc. It may be easier for > you to just try it out in your environment with diagnostics in place. This > code is in JobManager.java, and I will need to know what version of MCF you > have deployed. I can create a ticket and attach a patch that has the > needed diagnostics. Please let me know if that will work for you. > > Thanks, > Karl > > > On Fri, Apr 8, 2016 at 2:31 PM, Karl Wright <[email protected]> wrote: > >> Even further downstream, it still all looks good: >> >> >>>>>> >> Jetty started. >> Starting crawler... >> Scheduled job start; requestMinimum = true >> Starting job with requestMinimum = true >> When starting the job, requestMinimum = true >> <<<<<< >> >> So at the moment I am at a loss. >> >> Karl >> >> >> On Fri, Apr 8, 2016 at 2:20 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Radko, >>> >>> I set the same settings you did and instrumented the code. It records >>> the minimum job request: >>> >>> >>>>>> >>> Jetty started. >>> Starting crawler... >>> Scheduled job start; requestMinimum = true >>> Starting job with requestMinimum = true >>> <<<<<< >>> >>> This is the first run of the job, and the first time the schedule has >>> been used, just in case you are convinced this has something to do with >>> scheduled vs. non-scheduled job runs. >>> >>> I am going to add more instrumentation to see if there is any chance >>> there's a problem further downstream. >>> >>> Karl >>> >>> >>> On Fri, Apr 8, 2016 at 1:06 PM, Najman, Radko wrote: >>> >>>> Thanks a lot Karl! >>>> >>>> Here are the steps I did: >>>> >>>> 1. Run the job manually – it took a few hours. >>>> 2. Manually “minimal" run the same job – it was done in a minute >>>> 3. Setup scheduled “minimal” run – it took again a few hours as in >>>> the first step >>>> 4. Scheduled runs on the other days were fast as in step 2. >>>> >>>> Thanks for your comments, I’ll continue on it on Monday. >>>> >>>> Have a nice weekend, >>>> Radko >>>> >>>> >>>> >>>> From: Karl Wright <[email protected]> >>>> Reply-To: "[email protected]" <[email protected]> >>>> Date: Friday 8 April 2016 at 17:18 >>>> To: "[email protected]" <[email protected]> >>>> Subject: Re: Scheduled ManifoldCF jobs >>>> >>>> Also, going back in this thread a bit, let's make sure we are on the >>>> same page: >>>> >>>> >>>>>> >>>> I want to schedule these jobs for daily runs. I’m experiencing that the >>>> first scheduled run takes the same time as I ran the job for the first time >>>> manually. It seems it is recrawling all documents. Next scheduled runs are >>>> fast, a few minutes. Is it expected behaviour? >>>> <<<<<< >>>> >>>> If the first scheduled run is a complete crawl (meaning you did not >>>> select the "Minimal" setting for the schedule record), you *can* expect the >>>> job to look at all the documents. The reason is because Documentum does >>>> not give us any information about document deletions. We have to figure >>>> that out ourselves, and the only way to do it is to look at all the >>>> individual documents. The documents do not have to actually be crawled, >>>> but the connector *does* need to at least assemble its version identifier >>>> string, which requires an interaction with Documentum. >>>> >>>> So unless you have "Minimal" crawls selected everywhere, which won't >>>> ever detect deletions, you have to live with the time spent looking for >>>> deletions. We recommend that you do this at least occasionally, but >>>> certainly you wouldn't want to do it more than a couple times a month I >>>> would think. >>>> >>>> Hope this helps. >>>> Karl >>>> >>>> >>>> On Fri, Apr 8, 2016 at 10:54 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> There's one slightly funky thing about the Documentum connector that >>>>> tries to compensate for clock skew as follows: >>>>> >>>>> >>>>>> >>>>> // There seems to be some unexplained slop in the latest DCTM >>>>> version. It misses documents depending on how close to the r_modify_date >>>>> you happen to be. >>>>> // So, I've decreased the start time by a full five minutes, to >>>>> insure overlap. >>>>> if (startTime > 300000L) >>>>> startTime = startTime - 300000L; >>>>> else >>>>> startTime = 0L; >>>>> StringBuilder strDQLend = new StringBuilder(" where >>>>> r_modify_date >= " + buildDateString(startTime) + >>>>> " and r_modify_date<=" + buildDateString(seedTime) + >>>>> " AND (i_is_deleted=TRUE Or (i_is_deleted=FALSE AND >>>>> a_full_text=TRUE AND r_content_size>0"); >>>>> >>>>> <<<<<< >>>>> >>>>> The 300000 ms adjustment is five minutes, which doesn't seem like a >>>>> lot but maybe it is affecting your testing? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Fri, Apr 8, 2016 at 10:50 AM, Karl Wright <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Radko, >>>>>> >>>>>> There's no magic here; the seedingversion from the database is passed >>>>>> to the connector method which seeds documents. The only way this version >>>>>> gets cleared is if you save the job and the document specification >>>>>> changes. >>>>>> >>>>>> The only other possibility I can think of is that the documentum >>>>>> connector is ignoring the seedingversion information. I will look into >>>>>> this further over the weekend. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Apr 8, 2016 at 10:33 AM, Najman, Radko wrote: >>>>>> >>>>>>> Hi Karl, >>>>>>> >>>>>>> thanks for your clarification. >>>>>>> >>>>>>> I’m not changing any document specification information. I just set >>>>>>> “Scheduled time” and “Job invocation” on “Scheduling” tab, “Start >>>>>>> method” >>>>>>> on “Connection” tab and click “Save” button. That’s all. >>>>>>> >>>>>>> I tried to set all the scheduling information directly in Postres >>>>>>> database to be sure I didn’t change any document specification >>>>>>> information and the result was the same, all documents were >>>>>>> recrawled. >>>>>>> >>>>>>> One more thing I tried was to update “seedingversion” in “jobs” >>>>>>> table but again all documents were recrawled. >>>>>>> >>>>>>> Thanks, >>>>>>> Radko >>>>>>> >>>>>>> >>>>>>> >>>>>>> From: Karl Wright <[email protected]> >>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>> Date: Friday 1 April 2016 at 14:30 >>>>>>> To: "[email protected]" <[email protected]> >>>>>>> Subject: Re: Scheduled ManifoldCF jobs >>>>>>> >>>>>>> Sorry, that response was *almost* incoherent. :-) >>>>>>> >>>>>>> Trying again: >>>>>>> >>>>>>> As far as how MCF computes incremental changes, it does not matter >>>>>>> whether a job is run on schedule, or manually. But if you change >>>>>>> certain >>>>>>> aspects of the job, namely the document specification information, MCF >>>>>>> "starts over" at the beginning of time. It needs to do that because you >>>>>>> might well have made changes to the document specification that could >>>>>>> change the way documents are indexed. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 1, 2016 at 6:36 AM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Radko, >>>>>>>> >>>>>>>> For computing how MCF does job crawling, it does not care whether >>>>>>>> the job is run manually or by schedule. >>>>>>>> >>>>>>>> The issue is likely to be that you changed some other detail about >>>>>>>> the job definition that might have affected how documents are indexed. >>>>>>>> In >>>>>>>> that case, MCF would cause all documents to be recrawled because of >>>>>>>> that. >>>>>>>> Changes to a job's document specification information will cause that >>>>>>>> to be >>>>>>>> the case. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Apr 1, 2016 at 3:40 AM, Najman, Radko wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I have a few jobs crawling documents from Documentum. Some of >>>>>>>>> these jobs are quite big and the first run of the job takes a few >>>>>>>>> hours or >>>>>>>>> a day to finish. Then, when I do a “minimal run” for updates, the job >>>>>>>>> is >>>>>>>>> usually done in a few minutes. >>>>>>>>> >>>>>>>>> I want to schedule these jobs for daily runs. I’m experiencing >>>>>>>>> that the first scheduled run takes the same time as I ran the job for >>>>>>>>> the >>>>>>>>> first time manually. It seems it is recrawling all documents. Next >>>>>>>>> scheduled runs are fast, a few minutes. Is it expected behaviour? I >>>>>>>>> would >>>>>>>>> expect the first scheduled run to be fast too because the job was >>>>>>>>> already >>>>>>>>> finished before by manual start. Is there a way how to don’t recrawl >>>>>>>>> all >>>>>>>>> documents in this case, it’s really time consuming operation. >>>>>>>>> >>>>>>>>> My settings: >>>>>>>>> Schedule type: Scan every document once >>>>>>>>> Job invocation: Minimal >>>>>>>>> Scheduled time: once a day >>>>>>>>> Start method: Start when schedule window starts >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Radko >>>>>>>>> >>>>>>>> Notice: This e-mail message, together with any attachments, > contains > information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth, > New Jersey, USA 07033), and/or its affiliates Direct contact information > for affiliates is available at > http://www.merck.com/contact/contacts.html) that may be confidential, > proprietary copyrighted and/or legally privileged. It is intended solely > for the use of the individual or entity named on this message. If you are > not the intended recipient, and have received this message in error, > please notify us immediately by reply e-mail and then delete it from > your system. >
