Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 06:33:08 -0700

Hi Othman,

Yes, this shows that the jar we moved calls back into another jar, which
will also need to be moved.  *That* jar has yet another dependency too.


The list of jars is thus extended to include:

poi-ooxml-3.15.jar
dom4j-1.6.1.jar

Karl


On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]> wrote:

> You will find attached the stack trace. My apologies for the bad quality
> of the image, I'm doing my best to send you the stack trace as I don't have
> the right to send documents outside the company.
>
> Thank you for your time,
>
> Othman
>
> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> wrote:
>
>> Once again, I need a stack trace to diagnose what the problem is.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected]>
>> wrote:
>>
>>> Oh, actually it didn't solve the problem. I looked into the log file and
>>> saw the following error:
>>>
>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>
>>> Maybe another jar is missing ?
>>>
>>> Othman.
>>>
>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> wrote:
>>>
>>>> I have tried what you told me to do, and you expected the crawling
>>>> resumed. How about the regular expressions? How can I make complex regular
>>>> expressions in the job's paths tab ?
>>>>
>>>> Thank you very much for your help.
>>>>
>>>> Othman.
>>>>
>>>>
>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Ok, I will try it right away and let you know if it works.
>>>>>
>>>>> Othman.
>>>>>
>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> Oh, and you also may need to edit your options.env files to include
>>>>>> them in the classpath for startup.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> If you are amenable, there is another workaround you could try.
>>>>>>> Specifically:
>>>>>>>
>>>>>>> (1) Shut down all MCF processes.
>>>>>>> (2) Move the following two files from connector-common-lib to lib:
>>>>>>>
>>>>>>> xmlbeans-2.6.0.jar
>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>
>>>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>>>
>>>>>>> Please let me know what happens.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>
>>>>>>>> One simple workaround is to use the external Tika server
>>>>>>>> transformer rather than the embedded Tika Extractor.  I'm still looking
>>>>>>>> into why the jar is not being found.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Yes, I'm actually using the latest binary version, and my job got
>>>>>>>>> stuck on that specific file.
>>>>>>>>> The job status is still Running. You can see it in the attached
>>>>>>>>> file. For your information, the job started yesterday.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Othman
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>>>> I think we will need a ticket to address this, if you are indeed
>>>>>>>>>> using the binary distribution.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm actually using the binary version. For security reasons, I
>>>>>>>>>>> can't send any files from my computer. I have copied the stack 
>>>>>>>>>>> trace and
>>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, 
>>>>>>>>>>> I have
>>>>>>>>>>> read the documentation about how to restrict the crawling and I 
>>>>>>>>>>> don't think
>>>>>>>>>>> the '|' works in the specified. For instance, I would like to 
>>>>>>>>>>> restrict the
>>>>>>>>>>> crawling for the documents that counts the 'sound' word . I proceed 
>>>>>>>>>>> as
>>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I 
>>>>>>>>>>> noticed that
>>>>>>>>>>> it didn't take it into consideration.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Othman
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>
>>>>>>>>>>>> The way you restrict documents with the windows share connector
>>>>>>>>>>>> is by specifying information on the "Paths" tab in jobs that crawl 
>>>>>>>>>>>> windows
>>>>>>>>>>>> shares.  There is end-user documentation both online and 
>>>>>>>>>>>> distributed with
>>>>>>>>>>>> all binary distributions that describe how to do this.  Have you 
>>>>>>>>>>>> found it?
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for your response, I will start using zookeeper and
>>>>>>>>>>>>> I will let you know if it works. I have another question to ask. 
>>>>>>>>>>>>> Actually,
>>>>>>>>>>>>> I need to make some filters while crawling. I don't want to crawl 
>>>>>>>>>>>>> some
>>>>>>>>>>>>> files and some folders. Could you give me an example of how to 
>>>>>>>>>>>>> use the
>>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> File-based sync is deprecated because people often have
>>>>>>>>>>>>>> problems with getting file permissions right, and they do not 
>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient 
>>>>>>>>>>>>>> against
>>>>>>>>>>>>>> that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you
>>>>>>>>>>>>>> do not need huge amounts of memory.  The default values are more 
>>>>>>>>>>>>>> than
>>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for 
>>>>>>>>>>>>>> ManifoldCF.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is
>>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a 
>>>>>>>>>>>>>>> guidance on how to
>>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the 
>>>>>>>>>>>>>>> start-agent of
>>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked
>>>>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following 
>>>>>>>>>>>>>>>>> warnings :
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting
>>>>>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup 
>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output
>>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata 
>>>>>>>>>>>>>>>>> and a file
>>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't 
>>>>>>>>>>>>>>>>> extract the
>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes 
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks
>>>>>>>>>>>>>>>>>> like it might go away on retry, but does not.  It can be 
>>>>>>>>>>>>>>>>>> either on the
>>>>>>>>>>>>>>>>>> repository side or on the output side.  If you look at the 
>>>>>>>>>>>>>>>>>> Simple History
>>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able 
>>>>>>>>>>>>>>>>>> to get a
>>>>>>>>>>>>>>>>>> better sense of what went wrong.  Without further 
>>>>>>>>>>>>>>>>>> information, I can't say
>>>>>>>>>>>>>>>>>> any more.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société
>>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version 
>>>>>>>>>>>>>>>>>>> of manifoldCF
>>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this 
>>>>>>>>>>>>>>>>>>> reason, I'm using
>>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I 
>>>>>>>>>>>>>>>>>>> encountered a
>>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the 
>>>>>>>>>>>>>>>>>>> time, when
>>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for 
>>>>>>>>>>>>>>>>>>> example), it ends
>>>>>>>>>>>>>>>>>>> the job with the following error: repeated service 
>>>>>>>>>>>>>>>>>>> interruptions - failure
>>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: 
>>>>>>>>>>>>>>>>>>> socket write error.
>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem,
>>>>>>>>>>>>>>>>>>> please ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>

Re: Question about ManifoldCF 2.8

Reply via email to