Re: Question about ManifoldCF 2.8

Beelz Ryuzaki Thu, 31 Aug 2017 05:48:07 -0700

Ok, I will try it right away and let you know if it works.

Othman.


On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote:

> Oh, and you also may need to edit your options.env files to include them
> in the classpath for startup.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote:
>
>> If you are amenable, there is another workaround you could try.
>> Specifically:
>>
>> (1) Shut down all MCF processes.
>> (2) Move the following two files from connector-common-lib to lib:
>>
>> xmlbeans-2.6.0.jar
>> poi-ooxml-schemas-3.15.jar
>>
>> (3) Restart everything and see if your crawl resumes.
>>
>> Please let me know what happens.
>>
>> Karl
>>
>>
>>
>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> wrote:
>>
>>> I created a ticket for this: CONNECTORS-1450.
>>>
>>> One simple workaround is to use the external Tika server transformer
>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>> jar is not being found.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]>
>>> wrote:
>>>
>>>> Yes, I'm actually using the latest binary version, and my job got stuck
>>>> on that specific file.
>>>> The job status is still Running. You can see it in the attached file.
>>>> For your information, the job started yesterday.
>>>>
>>>> Thanks,
>>>>
>>>> Othman
>>>>
>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote:
>>>>
>>>>> It looks like a dependency of Apache POI is missing.
>>>>> I think we will need a ticket to address this, if you are indeed using
>>>>> the binary distribution.
>>>>>
>>>>> Thanks!
>>>>> Karl
>>>>>
>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm actually using the binary version. For security reasons, I can't
>>>>>> send any files from my computer. I have copied the stack trace and 
>>>>>> scanned
>>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read 
>>>>>> the
>>>>>> documentation about how to restrict the crawling and I don't think the 
>>>>>> '|'
>>>>>> works in the specified. For instance, I would like to restrict the 
>>>>>> crawling
>>>>>> for the documents that counts the 'sound' word . I proceed as follows:
>>>>>> *(SON)* . the document is with capital letters and I noticed that it 
>>>>>> didn't
>>>>>> take it into consideration.
>>>>>>
>>>>>> Thanks,
>>>>>> Othman
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Othman,
>>>>>>>
>>>>>>> The way you restrict documents with the windows share connector is
>>>>>>> by specifying information on the "Paths" tab in jobs that crawl windows
>>>>>>> shares.  There is end-user documentation both online and distributed 
>>>>>>> with
>>>>>>> all binary distributions that describe how to do this.  Have you found 
>>>>>>> it?
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Karl,
>>>>>>>>
>>>>>>>> Thank you for your response, I will start using zookeeper and I
>>>>>>>> will let you know if it works. I have another question to ask. 
>>>>>>>> Actually, I
>>>>>>>> need to make some filters while crawling. I don't want to crawl some 
>>>>>>>> files
>>>>>>>> and some folders. Could you give me an example of how to use the regex.
>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Othman
>>>>>>>>
>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Beelz,
>>>>>>>>>
>>>>>>>>> File-based sync is deprecated because people often have problems
>>>>>>>>> with getting file permissions right, and they do not understand how 
>>>>>>>>> to shut
>>>>>>>>> processes down cleanly, and zookeeper is resilient against that.  I 
>>>>>>>>> highly
>>>>>>>>> recommend using zookeeper sync.
>>>>>>>>>
>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do
>>>>>>>>> not need huge amounts of memory.  The default values are more than 
>>>>>>>>> enough
>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>>>>>>>>> different from file based sync? I also need a guidance on how to 
>>>>>>>>>> manage my
>>>>>>>>>> pc's memory. How many Go should I allocate for the start-agent of
>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>
>>>>>>>>>> Othman.
>>>>>>>>>>
>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>
>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into
>>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>
>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>>>>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down
>>>>>>>>>>>> process; locks may be left dangling. You must cleanup before 
>>>>>>>>>>>> restarting.
>>>>>>>>>>>>
>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a 
>>>>>>>>>>>> file
>>>>>>>>>>>> system as a repository connection. During the job, I don't extract 
>>>>>>>>>>>> the
>>>>>>>>>>>> content of the documents. I was wandering if the issue comes from
>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>
>>>>>>>>>>>> Othman.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>
>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like it
>>>>>>>>>>>>> might go away on retry, but does not.  It can be either on the 
>>>>>>>>>>>>> repository
>>>>>>>>>>>>> side or on the output side.  If you look at the Simple History in 
>>>>>>>>>>>>> the UI,
>>>>>>>>>>>>> or at the manifoldcf.log file, you should be able to get a better 
>>>>>>>>>>>>> sense of
>>>>>>>>>>>>> what went wrong.  Without further information, I can't say any 
>>>>>>>>>>>>> more.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale
>>>>>>>>>>>>>> in France. I'm actually using your recent version of manifoldCF 
>>>>>>>>>>>>>> 2.8 . I'm
>>>>>>>>>>>>>> working on an internal search engine. For this reason, I'm using 
>>>>>>>>>>>>>> manifoldcf
>>>>>>>>>>>>>> in order to index documents on windows shares. I encountered a 
>>>>>>>>>>>>>> serious
>>>>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when 
>>>>>>>>>>>>>> manifoldcf
>>>>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it ends 
>>>>>>>>>>>>>> the job
>>>>>>>>>>>>>> with the following error: repeated service interruptions - 
>>>>>>>>>>>>>> failure
>>>>>>>>>>>>>> processing document : software caused connection abort: socket 
>>>>>>>>>>>>>> write error.
>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem,
>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>
>

Re: Question about ManifoldCF 2.8

Reply via email to