Hi Karl, I was not meaning to request anything, just dumping some thoughts. I can take care of the rest.
Thanks!!! El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió: > Hi Rafa, > > Two points. > > First, Allesandro's case is arguably insolvable by any mechanism that > doesn't involve a sidecar process, because whether or not a job is running > or mcf is even up the credential tokens must be continually refreshed. I > don't know how to solve that within the canon of mcf. > > Second, connection management is really quite central to mcf. Independent > connection instances are the only way you can hope to do connection > throttling across cluster, for instance. Given that, you'd have to have a > pretty compelling case to request a rearchitecture, no? > > So -- are you going to finish the work for connectors-1198, or should I? > > Karl > > Sent from my Windows Phone > ------------------------------ > From: Rafa Haro > Sent: 5/9/2015 6:35 AM > To: [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');> > Cc: Rafa Haro; Timo Selvaraj > Subject: Re: File system continuous crawl settings > > Hi Karl, > > I understand. The thing I'd, as for example Alessandro also pointed out > some days ago, it is not strange to find situations where you might want to > initialize resources only once per job execution and that seems to be > impossible right now with current architecture but it also seems to have a > lot of sense to have that possibility. > > Should we consider to include that functionality? Some initializations > can be expensive and it is not possible always to use a singleton. > > Thanks Karl! > > El sábado, 9 de mayo de 2015, Karl Wright <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> escribió: > >> Hi Rafa, >> >> The problem was twofold. >> >> As stated before, the manifoldcf model for managing connections is that >> connection instances operate independently of each other. If what is >> required to set up the connection depends on the job, it defeats the whole >> manifoldcf pooling management strategy, since connections are swapped >> between jobs completely outside the control of the connector writer. So >> trying to be clever here buys you little. >> >> The actual failure also involved usage of variables which were >> uninitialized. >> >> In other connectors where pooling can be defined at other levels than >> just in mcf, the standard is to use a hardwired pool size of 1 for those >> cases. See the jira connector, for example. For searchblox, the only >> parameters other than pool size that you set this way are socket and >> connection timeout. In every other connector we have these are connection >> parameters, not specification parameters. I don't see any reason searchblox >> should be different. >> >> Karl >> Sent from my Windows Phone >> ------------------------------ >> From: Rafa Haro >> Sent: 5/9/2015 4:56 AM >> To: [email protected] >> Cc: Timo Selvaraj >> Subject: Re: File system continuous crawl settings >> >> Hi Karl and Tim, >> >> Karl, you were too fast and didn't give me time to take a look to the >> issue after confirming that it was an issue connector. Thanks for >> addressing it anyway. I will take a look to your changes but the job >> parameters should make more sense per job, not at connection configuration >> because it customizes the pool of http connections to the SearchBlox >> server. This could be redundant with the manifold thread management but the >> idea was the threads to be using that pool and not to create a single >> connection resource per thread. >> >> As we have observed before, we found changeling to create shared >> resources for the whole job in the getsession method and tried to trick it >> with class members variables as flags. >> >> Where was exactly the problem with the session management? >> >> Cheers, >> Rafa >> >> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió: >> >>> Hi Timo, >>> >>> I've taken a deep look at the SearchBlox code and found a significant >>> problem. I've created a patch for you to address it, although it is not >>> the final fix. The patch should work on either 2.1 or 1.9. See >>> CONNECTORS-1198 for complete details. >>> >>> Please let me know ASAP if the patch does not solve your immediate >>> problem, since I will be making other changes to the connector to bring it >>> in line with ManifoldCF standards. >>> >>> Karl >>> >>> >>> >>> On Fri, May 8, 2015 at 8:01 PM, Karl Wright <[email protected]> wrote: >>> >>>> That error is what I was afraid of. >>>> >>>> We need the complete exception trace. Can you find that and create a >>>> ticket, including the complete trace? >>>> >>>> My apologies; the searchblox connector is a contribution which >>>> obviously still has bugs. With the trace though I should be able to get >>>> you a patch. >>>> >>>> Karl >>>> >>>> Sent from my Windows Phone >>>> ------------------------------ >>>> From: Timo Selvaraj >>>> Sent: 5/8/2015 6:46 PM >>>> To: Karl Wright >>>> Cc: [email protected] >>>> >>>> Subject: Re: File system continuous crawl settings >>>> >>>> Hi Karl, >>>> >>>> The only error message which seems to be continuously thrown in >>>> manifold log is : >>>> >>>> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null >>>> java.lang.NullPointerException >>>> >>>> I do notice that the file that needs to deleted is shown under the >>>> Queue Status report and keeps jumping between “Processing” and “About to >>>> Process” statuses every 30 seconds. >>>> >>>> Timo >>>> >>>> >>>> On May 8, 2015, at 1:40 PM, Karl Wright <[email protected]> wrote: >>>> >>>> Hi Timo, >>>> >>>> As I said, I don't think your configuration is the source of the delete >>>> issue. I suspect the searchblox connector. >>>> >>>> In the absence of a thread dump, can you look for exceptions in the >>>> manifoldcf log? >>>> >>>> Karl >>>> >>>> Sent from my Windows Phone >>>> ------------------------------ >>>> From: Timo Selvaraj >>>> Sent: 5/8/2015 10:06 AM >>>> To: [email protected] >>>> Subject: Re: File system continuous crawl settings >>>> >>>> When I change the settings to the following, updated or modified >>>> documents are now indexed but deleting the documents that are removed is >>>> still an issue: >>>> >>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5 >>>> minutesMaximum recrawl interval:10 minutesExpiration >>>> interval:InfinityReseed >>>> interval:60 minutesNo scheduled run timesMaximum hop count for link >>>> type 'child':UnlimitedHop count mode:Delete unreachable documents >>>> >>>> Do I need to set the reseed interval to Infinity? >>>> >>>> Any thoughts? >>>> >>>> >>>> On May 8, 2015, at 6:18 AM, Karl Wright <[email protected]> wrote: >>>> >>>> I just tried your configuration here. A deleted document in the file >>>> system was indeed picked up as expected. >>>> >>>> I did notice that your "expiration" setting is, essentially, cleaning >>>> out documents at a rapid clip. With this setting, documents will be >>>> expired before they are recrawled. You probably want one strategy or the >>>> other but not both. >>>> >>>> As for why a deleted document is "stuck" in Processing: the only thing >>>> I can think of is that the output connection you've chosen is having >>>> trouble deleting the document from the index. What output connector are >>>> you using? >>>> >>>> Karl >>>> >>>> >>>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We are testing the continuous crawl feature for file system connector >>>>> on a small folder to test if new documents are added to the folder, >>>>> missing >>>>> documents removed and modified documents updated are handled by the >>>>> continuous crawl job: >>>>> >>>>> Here are the settings we use: >>>>> >>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5 >>>>> minutesMaximum recrawl interval:10 minutesExpiration interval:5 >>>>> minutesReseed interval:10 minutesNo scheduled run timesMaximum hop >>>>> count for link type 'child':UnlimitedHop count mode:Delete >>>>> unreachable documents >>>>> >>>>> Adding new documents seem to be getting picked up by the job however >>>>> removal of a document or update to a document are not being picked up. >>>>> >>>>> Am I missing any settings for the deletions or updates? I do see the >>>>> document that has been removed is showing as Processing under Queue Status >>>>> and others are showing as Waiting for Processing. >>>>> >>>>> Any idea what setting is missing for the deletes/updates to be >>>>> recognized and re-indexed? >>>>> >>>>> Thanks, >>>>> Timo >>>>> >>>> >>>> >>>> >>>> >>>
