No problem. I checked in what I think is the right fix this morning. Hope it looks ok to you?
Karl On Sun, May 10, 2015 at 6:52 AM, Rafa Haro <[email protected]> wrote: > Hi Karl, > > I was not meaning to request anything, just dumping some thoughts. I can > take care of the rest. > > Thanks!!! > > > El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió: > >> Hi Rafa, >> >> Two points. >> >> First, Allesandro's case is arguably insolvable by any mechanism that >> doesn't involve a sidecar process, because whether or not a job is running >> or mcf is even up the credential tokens must be continually refreshed. I >> don't know how to solve that within the canon of mcf. >> >> Second, connection management is really quite central to mcf. >> Independent connection instances are the only way you can hope to do >> connection throttling across cluster, for instance. Given that, you'd have >> to have a pretty compelling case to request a rearchitecture, no? >> >> So -- are you going to finish the work for connectors-1198, or should I? >> >> Karl >> >> Sent from my Windows Phone >> ------------------------------ >> From: Rafa Haro >> Sent: 5/9/2015 6:35 AM >> To: [email protected] >> Cc: Rafa Haro; Timo Selvaraj >> Subject: Re: File system continuous crawl settings >> >> Hi Karl, >> >> I understand. The thing I'd, as for example Alessandro also pointed out >> some days ago, it is not strange to find situations where you might want to >> initialize resources only once per job execution and that seems to be >> impossible right now with current architecture but it also seems to have a >> lot of sense to have that possibility. >> >> Should we consider to include that functionality? Some initializations >> can be expensive and it is not possible always to use a singleton. >> >> Thanks Karl! >> >> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió: >> >>> Hi Rafa, >>> >>> The problem was twofold. >>> >>> As stated before, the manifoldcf model for managing connections is that >>> connection instances operate independently of each other. If what is >>> required to set up the connection depends on the job, it defeats the whole >>> manifoldcf pooling management strategy, since connections are swapped >>> between jobs completely outside the control of the connector writer. So >>> trying to be clever here buys you little. >>> >>> The actual failure also involved usage of variables which were >>> uninitialized. >>> >>> In other connectors where pooling can be defined at other levels than >>> just in mcf, the standard is to use a hardwired pool size of 1 for those >>> cases. See the jira connector, for example. For searchblox, the only >>> parameters other than pool size that you set this way are socket and >>> connection timeout. In every other connector we have these are connection >>> parameters, not specification parameters. I don't see any reason searchblox >>> should be different. >>> >>> Karl >>> Sent from my Windows Phone >>> ------------------------------ >>> From: Rafa Haro >>> Sent: 5/9/2015 4:56 AM >>> To: [email protected] >>> Cc: Timo Selvaraj >>> Subject: Re: File system continuous crawl settings >>> >>> Hi Karl and Tim, >>> >>> Karl, you were too fast and didn't give me time to take a look to the >>> issue after confirming that it was an issue connector. Thanks for >>> addressing it anyway. I will take a look to your changes but the job >>> parameters should make more sense per job, not at connection configuration >>> because it customizes the pool of http connections to the SearchBlox >>> server. This could be redundant with the manifold thread management but the >>> idea was the threads to be using that pool and not to create a single >>> connection resource per thread. >>> >>> As we have observed before, we found changeling to create shared >>> resources for the whole job in the getsession method and tried to trick it >>> with class members variables as flags. >>> >>> Where was exactly the problem with the session management? >>> >>> Cheers, >>> Rafa >>> >>> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió: >>> >>>> Hi Timo, >>>> >>>> I've taken a deep look at the SearchBlox code and found a significant >>>> problem. I've created a patch for you to address it, although it is not >>>> the final fix. The patch should work on either 2.1 or 1.9. See >>>> CONNECTORS-1198 for complete details. >>>> >>>> Please let me know ASAP if the patch does not solve your immediate >>>> problem, since I will be making other changes to the connector to bring it >>>> in line with ManifoldCF standards. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Fri, May 8, 2015 at 8:01 PM, Karl Wright <[email protected]> wrote: >>>> >>>>> That error is what I was afraid of. >>>>> >>>>> We need the complete exception trace. Can you find that and create a >>>>> ticket, including the complete trace? >>>>> >>>>> My apologies; the searchblox connector is a contribution which >>>>> obviously still has bugs. With the trace though I should be able to get >>>>> you a patch. >>>>> >>>>> Karl >>>>> >>>>> Sent from my Windows Phone >>>>> ------------------------------ >>>>> From: Timo Selvaraj >>>>> Sent: 5/8/2015 6:46 PM >>>>> To: Karl Wright >>>>> Cc: [email protected] >>>>> >>>>> Subject: Re: File system continuous crawl settings >>>>> >>>>> Hi Karl, >>>>> >>>>> The only error message which seems to be continuously thrown in >>>>> manifold log is : >>>>> >>>>> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null >>>>> java.lang.NullPointerException >>>>> >>>>> I do notice that the file that needs to deleted is shown under the >>>>> Queue Status report and keeps jumping between “Processing” and “About to >>>>> Process” statuses every 30 seconds. >>>>> >>>>> Timo >>>>> >>>>> >>>>> On May 8, 2015, at 1:40 PM, Karl Wright <[email protected]> wrote: >>>>> >>>>> Hi Timo, >>>>> >>>>> As I said, I don't think your configuration is the source of the >>>>> delete issue. I suspect the searchblox connector. >>>>> >>>>> In the absence of a thread dump, can you look for exceptions in the >>>>> manifoldcf log? >>>>> >>>>> Karl >>>>> >>>>> Sent from my Windows Phone >>>>> ------------------------------ >>>>> From: Timo Selvaraj >>>>> Sent: 5/8/2015 10:06 AM >>>>> To: [email protected] >>>>> Subject: Re: File system continuous crawl settings >>>>> >>>>> When I change the settings to the following, updated or modified >>>>> documents are now indexed but deleting the documents that are removed is >>>>> still an issue: >>>>> >>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5 >>>>> minutesMaximum recrawl interval:10 minutesExpiration >>>>> interval:InfinityReseed >>>>> interval:60 minutesNo scheduled run timesMaximum hop count for link >>>>> type 'child':UnlimitedHop count mode:Delete unreachable documents >>>>> >>>>> Do I need to set the reseed interval to Infinity? >>>>> >>>>> Any thoughts? >>>>> >>>>> >>>>> On May 8, 2015, at 6:18 AM, Karl Wright <[email protected]> wrote: >>>>> >>>>> I just tried your configuration here. A deleted document in the file >>>>> system was indeed picked up as expected. >>>>> >>>>> I did notice that your "expiration" setting is, essentially, cleaning >>>>> out documents at a rapid clip. With this setting, documents will be >>>>> expired before they are recrawled. You probably want one strategy or the >>>>> other but not both. >>>>> >>>>> As for why a deleted document is "stuck" in Processing: the only thing >>>>> I can think of is that the output connection you've chosen is having >>>>> trouble deleting the document from the index. What output connector are >>>>> you using? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> We are testing the continuous crawl feature for file system connector >>>>>> on a small folder to test if new documents are added to the folder, >>>>>> missing >>>>>> documents removed and modified documents updated are handled by the >>>>>> continuous crawl job: >>>>>> >>>>>> Here are the settings we use: >>>>>> >>>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5 >>>>>> minutesMaximum recrawl interval:10 minutesExpiration interval:5 >>>>>> minutesReseed interval:10 minutesNo scheduled run timesMaximum hop >>>>>> count for link type 'child':UnlimitedHop count mode:Delete >>>>>> unreachable documents >>>>>> >>>>>> Adding new documents seem to be getting picked up by the job however >>>>>> removal of a document or update to a document are not being picked up. >>>>>> >>>>>> Am I missing any settings for the deletions or updates? I do see the >>>>>> document that has been removed is showing as Processing under Queue >>>>>> Status >>>>>> and others are showing as Waiting for Processing. >>>>>> >>>>>> Any idea what setting is missing for the deletes/updates to be >>>>>> recognized and re-indexed? >>>>>> >>>>>> Thanks, >>>>>> Timo >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>
