Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 08:30:43 -0700

It's not related at all to elasticsearch.
Karl


On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <[email protected]> wrote:

> Could it be a problem of elasticsearch's version ? I'm actually using
> 2.1.0 which is pretty old for this new version of ManifoldCF?
>
> Othman.
>
> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <[email protected]> wrote:
>
>> I moved back both the jars you mentioned and a different is showing. You
>> will find the stack trace attached.
>>
>> Thanks,
>> Othman
>>
>> On Thu, 31 Aug 2017 at 17:09, Karl Wright <[email protected]> wrote:
>>
>>> I've looked at the dependencies; you should not have moved
>>> poi-3.15.jar.  Please move that back, and commons-collections4-4.1.jar too.
>>>
>>> You *will* need to move curvesapi-1.04.jar though.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <[email protected]>
>>> wrote:
>>>
>>>> If you include poi.jar, then all dependencies of poi.jar must also be
>>>> included.  This would mean that curvesapi-1.04.jar and
>>>> commons-collections4-4.1.jar should also be included.
>>>>
>>>> Karl
>>>>
>>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> I added the two jars that you have mentioned and another one :
>>>>> poi-3.15.jar . Unfortunately, there is another error showing. This time, 
>>>>> it
>>>>> concerns excel files. You will find attached the stack trace.
>>>>>
>>>>> Othman.
>>>>>
>>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> Hi Othman,
>>>>>>
>>>>>> Yes, this shows that the jar we moved calls back into another jar,
>>>>>> which will also need to be moved.  *That* jar has yet another dependency
>>>>>> too.
>>>>>>
>>>>>> The list of jars is thus extended to include:
>>>>>>
>>>>>> poi-ooxml-3.15.jar
>>>>>> dom4j-1.6.1.jar
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> You will find attached the stack trace. My apologies for the bad
>>>>>>> quality of the image, I'm doing my best to send you the stack trace as I
>>>>>>> don't have the right to send documents outside the company.
>>>>>>>
>>>>>>> Thank you for your time,
>>>>>>>
>>>>>>> Othman
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Once again, I need a stack trace to diagnose what the problem is.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Oh, actually it didn't solve the problem. I looked into the log
>>>>>>>>> file and saw the following error:
>>>>>>>>>
>>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader.
>>>>>>>>>
>>>>>>>>> Maybe another jar is missing ?
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have tried what you told me to do, and you expected the
>>>>>>>>>> crawling resumed. How about the regular expressions? How can I make 
>>>>>>>>>> complex
>>>>>>>>>> regular expressions in the job's paths tab ?
>>>>>>>>>>
>>>>>>>>>> Thank you very much for your help.
>>>>>>>>>>
>>>>>>>>>> Othman.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, I will try it right away and let you know if it works.
>>>>>>>>>>>
>>>>>>>>>>> Othman.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Oh, and you also may need to edit your options.env files to
>>>>>>>>>>>> include them in the classpath for startup.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you are amenable, there is another workaround you could
>>>>>>>>>>>>> try.  Specifically:
>>>>>>>>>>>>>
>>>>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>>>>> (2) Move the following two files from connector-common-lib to
>>>>>>>>>>>>> lib:
>>>>>>>>>>>>>
>>>>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>>>>
>>>>>>>>>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One simple workaround is to use the external Tika server
>>>>>>>>>>>>>> transformer rather than the embedded Tika Extractor.  I'm still 
>>>>>>>>>>>>>> looking
>>>>>>>>>>>>>> into why the jar is not being found.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my
>>>>>>>>>>>>>>> job got stuck on that specific file.
>>>>>>>>>>>>>>> The job status is still Running. You can see it in the
>>>>>>>>>>>>>>> attached file. For your information, the job started yesterday.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>>>>>>>>>> I think we will need a ticket to address this, if you are
>>>>>>>>>>>>>>>> indeed using the binary distribution.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm actually using the binary version. For security
>>>>>>>>>>>>>>>>> reasons, I can't send any files from my computer. I have 
>>>>>>>>>>>>>>>>> copied the stack
>>>>>>>>>>>>>>>>> trace and scanned it with my cellphone. I hope it will be 
>>>>>>>>>>>>>>>>> helpful.
>>>>>>>>>>>>>>>>> Meanwhile, I have read the documentation about how to 
>>>>>>>>>>>>>>>>> restrict the crawling
>>>>>>>>>>>>>>>>> and I don't think the '|' works in the specified. For 
>>>>>>>>>>>>>>>>> instance, I would
>>>>>>>>>>>>>>>>> like to restrict the crawling for the documents that counts 
>>>>>>>>>>>>>>>>> the 'sound'
>>>>>>>>>>>>>>>>> word . I proceed as follows: *(SON)* . the document is with 
>>>>>>>>>>>>>>>>> capital letters
>>>>>>>>>>>>>>>>> and I noticed that it didn't take it into consideration.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The way you restrict documents with the windows share
>>>>>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in 
>>>>>>>>>>>>>>>>>> jobs that
>>>>>>>>>>>>>>>>>> crawl windows shares.  There is end-user documentation both 
>>>>>>>>>>>>>>>>>> online and
>>>>>>>>>>>>>>>>>> distributed with all binary distributions that describe how 
>>>>>>>>>>>>>>>>>> to do this.
>>>>>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you for your response, I will start using
>>>>>>>>>>>>>>>>>>> zookeeper and I will let you know if it works. I have 
>>>>>>>>>>>>>>>>>>> another question to
>>>>>>>>>>>>>>>>>>> ask. Actually, I need to make some filters while crawling. 
>>>>>>>>>>>>>>>>>>> I don't want to
>>>>>>>>>>>>>>>>>>> crawl some files and some folders. Could you give me an 
>>>>>>>>>>>>>>>>>>> example of how to
>>>>>>>>>>>>>>>>>>> use the regex. Does the regex allow to use /i to ignore 
>>>>>>>>>>>>>>>>>>> cases ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have
>>>>>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do 
>>>>>>>>>>>>>>>>>>>> not understand
>>>>>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is 
>>>>>>>>>>>>>>>>>>>> resilient against
>>>>>>>>>>>>>>>>>>>> that.  I highly recommend using zookeeper sync.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory
>>>>>>>>>>>>>>>>>>>> so you do not need huge amounts of memory.  The default 
>>>>>>>>>>>>>>>>>>>> values are more
>>>>>>>>>>>>>>>>>>>> than enough for 35,000 files, which is a pretty small job 
>>>>>>>>>>>>>>>>>>>> for ManifoldCF.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how
>>>>>>>>>>>>>>>>>>>>> is zookeeper different from file based sync? I also need 
>>>>>>>>>>>>>>>>>>>>> a guidance on how
>>>>>>>>>>>>>>>>>>>>> to manage my pc's memory. How many Go should I allocate 
>>>>>>>>>>>>>>>>>>>>> for the start-agent
>>>>>>>>>>>>>>>>>>>>> of ManifoldCF? Is 4Go enough in order to crawler 35K 
>>>>>>>>>>>>>>>>>>>>> files ?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's
>>>>>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have
>>>>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the 
>>>>>>>>>>>>>>>>>>>>>>> following warnings :
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.
>>>>>>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full.
>>>>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You 
>>>>>>>>>>>>>>>>>>>>>>> must cleanup before
>>>>>>>>>>>>>>>>>>>>>>> restarting.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch
>>>>>>>>>>>>>>>>>>>>>>> output connection. Moreover, the job uses Tika to 
>>>>>>>>>>>>>>>>>>>>>>> extract metadata and a
>>>>>>>>>>>>>>>>>>>>>>> file system as a repository connection. During the job, 
>>>>>>>>>>>>>>>>>>>>>>> I don't extract the
>>>>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue 
>>>>>>>>>>>>>>>>>>>>>>> comes from
>>>>>>>>>>>>>>>>>>>>>>> elasticsearch ?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that
>>>>>>>>>>>>>>>>>>>>>>>> looks like it might go away on retry, but does not.  
>>>>>>>>>>>>>>>>>>>>>>>> It can be either on
>>>>>>>>>>>>>>>>>>>>>>>> the repository side or on the output side.  If you 
>>>>>>>>>>>>>>>>>>>>>>>> look at the Simple
>>>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log file, you 
>>>>>>>>>>>>>>>>>>>>>>>> should be able to get
>>>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong.  Without further 
>>>>>>>>>>>>>>>>>>>>>>>> information, I can't
>>>>>>>>>>>>>>>>>>>>>>>> say any more.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from
>>>>>>>>>>>>>>>>>>>>>>>>> société générale in France. I'm actually using your 
>>>>>>>>>>>>>>>>>>>>>>>>> recent version of
>>>>>>>>>>>>>>>>>>>>>>>>> manifoldCF 2.8 . I'm working on an internal search 
>>>>>>>>>>>>>>>>>>>>>>>>> engine. For this reason,
>>>>>>>>>>>>>>>>>>>>>>>>> I'm using manifoldcf in order to index documents on 
>>>>>>>>>>>>>>>>>>>>>>>>> windows shares. I
>>>>>>>>>>>>>>>>>>>>>>>>> encountered a serious problem while crawling 35K 
>>>>>>>>>>>>>>>>>>>>>>>>> documents. Most of the
>>>>>>>>>>>>>>>>>>>>>>>>> time, when manifoldcf start crawling a big sized 
>>>>>>>>>>>>>>>>>>>>>>>>> documents (19Mo for
>>>>>>>>>>>>>>>>>>>>>>>>> example), it ends the job with the following error: 
>>>>>>>>>>>>>>>>>>>>>>>>> repeated service
>>>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure processing document : 
>>>>>>>>>>>>>>>>>>>>>>>>> software caused connection
>>>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error.
>>>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this
>>>>>>>>>>>>>>>>>>>>>>>>> problem, please ?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>

Re: Question about ManifoldCF 2.8

Reply via email to