Re: File system continuous crawl settings

Karl Wright Sun, 10 May 2015 06:08:42 -0700

No problem.

I checked in what I think is the right fix this morning.  Hope it looks ok
to you?


Karl


On Sun, May 10, 2015 at 6:52 AM, Rafa Haro <[email protected]> wrote:

> Hi Karl,
>
> I was not meaning to request anything, just dumping some thoughts. I can
> take care of the rest.
>
> Thanks!!!
>
>
> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió:
>
>> Hi Rafa,
>>
>> Two points.
>>
>> First, Allesandro's case is arguably insolvable by any mechanism that
>> doesn't involve a sidecar process, because whether or not a job is running
>> or mcf is even up the credential tokens must be continually refreshed.  I
>> don't know how to solve that within the canon of mcf.
>>
>> Second, connection management is really quite central to mcf.
>> Independent connection instances are the only way you can hope to do
>> connection throttling across cluster, for instance.  Given that, you'd have
>> to have a pretty compelling case to request a rearchitecture, no?
>>
>> So -- are you going to finish the work for connectors-1198, or should I?
>>
>> Karl
>>
>> Sent from my Windows Phone
>> ------------------------------
>> From: Rafa Haro
>> Sent: 5/9/2015 6:35 AM
>> To: [email protected]
>> Cc: Rafa Haro; Timo Selvaraj
>> Subject: Re: File system continuous crawl settings
>>
>> Hi Karl,
>>
>> I understand. The thing I'd, as for example Alessandro also pointed out
>> some days ago, it is not strange to find situations where you might want to
>> initialize resources only once per job execution and that seems to be
>> impossible right now with current architecture but it also seems to have a
>> lot of sense to have that possibility.
>>
>> Should we consider to include that  functionality? Some initializations
>> can be expensive and it is not possible always to use a singleton.
>>
>> Thanks Karl!
>>
>> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió:
>>
>>> Hi Rafa,
>>>
>>> The problem was twofold.
>>>
>>> As stated before, the manifoldcf model for managing connections is that
>>> connection instances operate independently of each other.  If what is
>>> required to set up the connection depends on the job, it defeats the whole
>>> manifoldcf pooling management strategy, since connections are swapped
>>> between jobs completely outside the control of the connector writer.  So
>>> trying to be clever here buys you little.
>>>
>>> The actual failure also involved usage of variables which were
>>> uninitialized.
>>>
>>> In other connectors where pooling can be defined at other levels than
>>> just in mcf, the standard is to use a hardwired pool  size of 1 for those
>>> cases.  See the jira connector, for example.  For searchblox, the only
>>> parameters other than pool size that you set this way are socket and
>>> connection timeout.  In every other connector we have these are connection
>>> parameters, not specification parameters. I don't see any reason searchblox
>>> should be different.
>>>
>>> Karl
>>> Sent from my Windows Phone
>>> ------------------------------
>>> From: Rafa Haro
>>> Sent: 5/9/2015 4:56 AM
>>> To: [email protected]
>>> Cc: Timo Selvaraj
>>> Subject: Re: File system continuous crawl settings
>>>
>>> Hi Karl and Tim,
>>>
>>> Karl, you were too fast and didn't give me time to take a look to the
>>> issue after confirming that it was an issue connector. Thanks for
>>> addressing it anyway. I will take a look to your changes but the job
>>> parameters should make more sense per job, not at connection configuration
>>> because it customizes the pool of http connections to the SearchBlox
>>> server. This could be redundant with the manifold thread management but the
>>> idea was the threads to be using that pool and not to create a single
>>> connection resource per thread.
>>>
>>> As we have observed before, we found changeling to create shared
>>> resources for the whole job in the getsession method and tried to trick it
>>> with class members variables as flags.
>>>
>>> Where was exactly the problem with the session management?
>>>
>>> Cheers,
>>> Rafa
>>>
>>> El sábado, 9 de mayo de 2015, Karl Wright <[email protected]> escribió:
>>>
>>>> Hi Timo,
>>>>
>>>> I've taken a deep look at the SearchBlox code and found a significant
>>>> problem.  I've created a patch for you to address it, although it is not
>>>> the final fix.  The patch should work on either 2.1 or 1.9.  See
>>>> CONNECTORS-1198 for complete details.
>>>>
>>>> Please let me know ASAP if the patch does not solve your immediate
>>>> problem, since I will be making other changes to the connector to bring it
>>>> in line with ManifoldCF standards.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Fri, May 8, 2015 at 8:01 PM, Karl Wright <[email protected]> wrote:
>>>>
>>>>> That error is what I was afraid of.
>>>>>
>>>>> We need the complete exception trace.  Can you find that and create a
>>>>> ticket, including the complete trace?
>>>>>
>>>>> My apologies; the searchblox connector is a contribution which
>>>>> obviously still has bugs.  With the trace though I should be able to get
>>>>> you a patch.
>>>>>
>>>>> Karl
>>>>>
>>>>> Sent from my Windows Phone
>>>>> ------------------------------
>>>>> From: Timo Selvaraj
>>>>> Sent: 5/8/2015 6:46 PM
>>>>> To: Karl Wright
>>>>> Cc: [email protected]
>>>>>
>>>>> Subject: Re: File system continuous crawl settings
>>>>>
>>>>> Hi Karl,
>>>>>
>>>>> The only error message which seems to be continuously thrown in
>>>>> manifold log is :
>>>>>
>>>>> FATAL 2015-05-08 18:42:47,043 (Worker thread '40') - Error tossed: null
>>>>> java.lang.NullPointerException
>>>>>
>>>>> I do notice that the file that needs to deleted is shown under the
>>>>> Queue Status report and keeps jumping between “Processing” and “About to
>>>>> Process” statuses every 30 seconds.
>>>>>
>>>>> Timo
>>>>>
>>>>>
>>>>> On May 8, 2015, at 1:40 PM, Karl Wright <[email protected]> wrote:
>>>>>
>>>>> Hi Timo,
>>>>>
>>>>> As I said, I don't think your configuration is the source of the
>>>>> delete issue. I suspect the searchblox connector.
>>>>>
>>>>> In the absence of a thread dump, can you look for exceptions in the
>>>>> manifoldcf log?
>>>>>
>>>>> Karl
>>>>>
>>>>> Sent from my Windows Phone
>>>>> ------------------------------
>>>>> From: Timo Selvaraj
>>>>> Sent: 5/8/2015 10:06 AM
>>>>> To: [email protected]
>>>>> Subject: Re: File system continuous crawl settings
>>>>>
>>>>> When I change the settings to the following, updated or modified
>>>>> documents are now indexed but deleting the documents that are removed is
>>>>> still an issue:
>>>>>
>>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>>>>> minutesMaximum recrawl interval:10 minutesExpiration 
>>>>> interval:InfinityReseed
>>>>> interval:60 minutesNo scheduled run timesMaximum hop count for link
>>>>> type 'child':UnlimitedHop count mode:Delete unreachable documents
>>>>>
>>>>> Do I need to set the reseed interval to Infinity?
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>> On May 8, 2015, at 6:18 AM, Karl Wright <[email protected]> wrote:
>>>>>
>>>>> I just tried your configuration here.  A deleted document in the file
>>>>> system was indeed picked up as expected.
>>>>>
>>>>> I did notice that your "expiration" setting is, essentially, cleaning
>>>>> out documents at a rapid clip.  With this setting, documents will be
>>>>> expired before they are recrawled.  You probably want one strategy or the
>>>>> other but not both.
>>>>>
>>>>> As for why a deleted document is "stuck" in Processing: the only thing
>>>>> I can think of is that the output connection you've chosen is having
>>>>> trouble deleting the document from the index.  What output connector are
>>>>> you using?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Fri, May 8, 2015 at 4:36 AM, Timo Selvaraj <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are testing the continuous crawl feature for file system connector
>>>>>> on a small folder to test if new documents are added to the folder, 
>>>>>> missing
>>>>>> documents removed and modified documents updated are handled by the
>>>>>> continuous crawl job:
>>>>>>
>>>>>> Here are the settings we use:
>>>>>>
>>>>>> Schedule type:Rescan documents dynamicallyMinimum recrawl interval:5
>>>>>> minutesMaximum recrawl interval:10 minutesExpiration interval:5
>>>>>> minutesReseed interval:10 minutesNo scheduled run timesMaximum hop
>>>>>> count for link type 'child':UnlimitedHop count mode:Delete
>>>>>> unreachable documents
>>>>>>
>>>>>> Adding new documents seem to be getting picked up by the job however
>>>>>> removal of a document or update to a document are not being picked up.
>>>>>>
>>>>>> Am I missing any settings for the deletions or updates? I do see the
>>>>>> document that has been removed is showing as Processing under Queue 
>>>>>> Status
>>>>>> and others are showing as Waiting for Processing.
>>>>>>
>>>>>> Any idea what setting is missing for the deletes/updates to be
>>>>>> recognized and re-indexed?
>>>>>>
>>>>>> Thanks,
>>>>>> Timo
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>

Re: File system continuous crawl settings

Reply via email to