Re: Question about ManifoldCF 2.8

Beelz Ryuzaki Thu, 31 Aug 2017 06:16:02 -0700

Oh, actually it didn't solve the problem. I looked into the log file and
saw the following error:


Error tossed : org/apache/poi/POIXMLTypeLoader
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.

Maybe another jar is missing ?

Othman.

On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> wrote:

> I have tried what you told me to do, and you expected the crawling
> resumed. How about the regular expressions? How can I make complex regular
> expressions in the job's paths tab ?
>
> Thank you very much for your help.
>
> Othman.
>
>
> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> wrote:
>
>> Ok, I will try it right away and let you know if it works.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote:
>>
>>> Oh, and you also may need to edit your options.env files to include them
>>> in the classpath for startup.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote:
>>>
>>>> If you are amenable, there is another workaround you could try.
>>>> Specifically:
>>>>
>>>> (1) Shut down all MCF processes.
>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>
>>>> xmlbeans-2.6.0.jar
>>>> poi-ooxml-schemas-3.15.jar
>>>>
>>>> (3) Restart everything and see if your crawl resumes.
>>>>
>>>> Please let me know what happens.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>
>>>>> One simple workaround is to use the external Tika server transformer
>>>>> rather than the embedded Tika Extractor.  I'm still looking into why the
>>>>> jar is not being found.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>>> stuck on that specific file.
>>>>>> The job status is still Running. You can see it in the attached file.
>>>>>> For your information, the job started yesterday.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Othman
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote:
>>>>>>
>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>>> using the binary distribution.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Karl
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm actually using the binary version. For security reasons, I
>>>>>>>> can't send any files from my computer. I have copied the stack trace 
>>>>>>>> and
>>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, I 
>>>>>>>> have
>>>>>>>> read the documentation about how to restrict the crawling and I don't 
>>>>>>>> think
>>>>>>>> the '|' works in the specified. For instance, I would like to restrict 
>>>>>>>> the
>>>>>>>> crawling for the documents that counts the 'sound' word . I proceed as
>>>>>>>> follows: *(SON)* . the document is with capital letters and I noticed 
>>>>>>>> that
>>>>>>>> it didn't take it into consideration.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Othman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Othman,
>>>>>>>>>
>>>>>>>>> The way you restrict documents with the windows share connector is
>>>>>>>>> by specifying information on the "Paths" tab in jobs that crawl 
>>>>>>>>> windows
>>>>>>>>> shares.  There is end-user documentation both online and distributed 
>>>>>>>>> with
>>>>>>>>> all binary distributions that describe how to do this.  Have you 
>>>>>>>>> found it?
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Karl,
>>>>>>>>>>
>>>>>>>>>> Thank you for your response, I will start using zookeeper and I
>>>>>>>>>> will let you know if it works. I have another question to ask. 
>>>>>>>>>> Actually, I
>>>>>>>>>> need to make some filters while crawling. I don't want to crawl some 
>>>>>>>>>> files
>>>>>>>>>> and some folders. Could you give me an example of how to use the 
>>>>>>>>>> regex.
>>>>>>>>>> Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Othman
>>>>>>>>>>
>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>
>>>>>>>>>>> File-based sync is deprecated because people often have problems
>>>>>>>>>>> with getting file permissions right, and they do not understand how 
>>>>>>>>>>> to shut
>>>>>>>>>>> processes down cleanly, and zookeeper is resilient against that.  I 
>>>>>>>>>>> highly
>>>>>>>>>>> recommend using zookeeper sync.
>>>>>>>>>>>
>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do
>>>>>>>>>>> not need huge amounts of memory.  The default values are more than 
>>>>>>>>>>> enough
>>>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>>> zookeeper different from file based sync? I also need a guidance 
>>>>>>>>>>>> on how to
>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>>> start-agent of
>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>
>>>>>>>>>>>> Othman.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into
>>>>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase)
>>>>>>>>>>>>>> Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting
>>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup before
>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and 
>>>>>>>>>>>>>> a file
>>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>>> extract the
>>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes from
>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like
>>>>>>>>>>>>>>> it might go away on retry, but does not.  It can be either on 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> repository side or on the output side.  If you look at the 
>>>>>>>>>>>>>>> Simple History
>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able to 
>>>>>>>>>>>>>>> get a
>>>>>>>>>>>>>>> better sense of what went wrong.  Without further information, 
>>>>>>>>>>>>>>> I can't say
>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version of 
>>>>>>>>>>>>>>>> manifoldCF
>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this 
>>>>>>>>>>>>>>>> reason, I'm using
>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I 
>>>>>>>>>>>>>>>> encountered a
>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the 
>>>>>>>>>>>>>>>> time, when
>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for 
>>>>>>>>>>>>>>>> example), it ends
>>>>>>>>>>>>>>>> the job with the following error: repeated service 
>>>>>>>>>>>>>>>> interruptions - failure
>>>>>>>>>>>>>>>> processing document : software caused connection abort: socket 
>>>>>>>>>>>>>>>> write error.
>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem,
>>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>

Re: Question about ManifoldCF 2.8

Reply via email to