Re: Question about ManifoldCF 2.8

Beelz Ryuzaki Thu, 31 Aug 2017 06:02:47 -0700

I have tried what you told me to do, and you expected the crawling resumed.
How about the regular expressions? How can I make complex regular
expressions in the job's paths tab ?


Thank you very much for your help.

Othman.


On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> wrote:

> Ok, I will try it right away and let you know if it works.
>
> Othman.
>
> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote:
>
>> Oh, and you also may need to edit your options.env files to include them
>> in the classpath for startup.
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote:
>>
>>> If you are amenable, there is another workaround you could try.
>>> Specifically:
>>>
>>> (1) Shut down all MCF processes.
>>> (2) Move the following two files from connector-common-lib to lib:
>>>
>>> xmlbeans-2.6.0.jar
>>> poi-ooxml-schemas-3.15.jar
>>>
>>> (3) Restart everything and see if your crawl resumes.
>>>
>>> Please let me know what happens.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> wrote:
>>>
>>>> I created a ticket for this: CONNECTORS-1450.
>>>>
>>>> One simple workaround is to use the external Tika server transformer
>>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>>> jar is not being found.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>> stuck on that specific file.
>>>>> The job status is still Running. You can see it in the attached file.
>>>>> For your information, the job started yesterday.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Othman
>>>>>
>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>> using the binary distribution.
>>>>>>
>>>>>> Thanks!
>>>>>> Karl
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm actually using the binary version. For security reasons, I can't
>>>>>>> send any files from my computer. I have copied the stack trace and 
>>>>>>> scanned
>>>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read 
>>>>>>> the
>>>>>>> documentation about how to restrict the crawling and I don't think the 
>>>>>>> '|'
>>>>>>> works in the specified. For instance, I would like to restrict the 
>>>>>>> crawling
>>>>>>> for the documents that counts the 'sound' word . I proceed as follows:
>>>>>>> *(SON)* . the document is with capital letters and I noticed that it 
>>>>>>> didn't
>>>>>>> take it into consideration.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Othman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Othman,
>>>>>>>>
>>>>>>>> The way you restrict documents with the windows share connector is
>>>>>>>> by specifying information on the "Paths" tab in jobs that crawl windows
>>>>>>>> shares.  There is end-user documentation both online and distributed 
>>>>>>>> with
>>>>>>>> all binary distributions that describe how to do this.  Have you found 
>>>>>>>> it?
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello Karl,
>>>>>>>>>
>>>>>>>>> Thank you for your response, I will start using zookeeper and I
>>>>>>>>> will let you know if it works. I have another question to ask. 
>>>>>>>>> Actually, I
>>>>>>>>> need to make some filters while crawling. I don't want to crawl some 
>>>>>>>>> files
>>>>>>>>> and some folders. Could you give me an example of how to use the 
>>>>>>>>> regex.
>>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Othman
>>>>>>>>>
>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Beelz,
>>>>>>>>>>
>>>>>>>>>> File-based sync is deprecated because people often have problems
>>>>>>>>>> with getting file permissions right, and they do not understand how 
>>>>>>>>>> to shut
>>>>>>>>>> processes down cleanly, and zookeeper is resilient against that.  I 
>>>>>>>>>> highly
>>>>>>>>>> recommend using zookeeper sync.
>>>>>>>>>>
>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do
>>>>>>>>>> not need huge amounts of memory.  The default values are more than 
>>>>>>>>>> enough
>>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>> zookeeper different from file based sync? I also need a guidance on 
>>>>>>>>>>> how to
>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>> start-agent of
>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>
>>>>>>>>>>> Othman.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>
>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>
>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into
>>>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>>>>>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down
>>>>>>>>>>>>> process; locks may be left dangling. You must cleanup before 
>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>> extract the
>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes from
>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like
>>>>>>>>>>>>>> it might go away on retry, but does not.  It can be either on the
>>>>>>>>>>>>>> repository side or on the output side.  If you look at the 
>>>>>>>>>>>>>> Simple History
>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able to 
>>>>>>>>>>>>>> get a
>>>>>>>>>>>>>> better sense of what went wrong.  Without further information, I 
>>>>>>>>>>>>>> can't say
>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale
>>>>>>>>>>>>>>> in France. I'm actually using your recent version of manifoldCF 
>>>>>>>>>>>>>>> 2.8 . I'm
>>>>>>>>>>>>>>> working on an internal search engine. For this reason, I'm 
>>>>>>>>>>>>>>> using manifoldcf
>>>>>>>>>>>>>>> in order to index documents on windows shares. I encountered a 
>>>>>>>>>>>>>>> serious
>>>>>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when 
>>>>>>>>>>>>>>> manifoldcf
>>>>>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it 
>>>>>>>>>>>>>>> ends the job
>>>>>>>>>>>>>>> with the following error: repeated service interruptions - 
>>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>>> processing document : software caused connection abort: socket 
>>>>>>>>>>>>>>> write error.
>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem,
>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>>

Re: Question about ManifoldCF 2.8

Reply via email to