Re: Question about ManifoldCF 2.8

Karl Wright Fri, 01 Sep 2017 02:20:03 -0700

I've made the specified jar environment changes here, and indeed committed
them to trunk and to the release-2.8 branch in preparation for creating a
point release to deal with this issue.  I ran a job locally that
successfully indexed a couple of visio files as well as some xslx, docx,
and PDF files.  The system did run out of memory while extracting a large
PDF, so I'll be increasing the standard limits for agents memory also to be
able to work better with Tika extraction.


Thanks,
Karl


On Thu, Aug 31, 2017 at 12:19 PM, Karl Wright <[email protected]> wrote:

> Please do the following:
>
> (0) Shut down all ManifoldCF processes.
> (1) Move poi*.jar from connector-common-lib to lib.
> (2) Move dom4j*.jar from connector-common-lib to lib.
> (3) Move commons-collections4*.jar from connector-common-lib to lib.
> (4) Move xmlbeans*.java from connector-common-lib to lib.
> (5) Move curvesapi*.jar from connector-common-lib to lib.
> (6) Modify your options.env to include all of the jars you moved.
> (7) Start up all ManifoldCF processes.
> (8) If you still get stack traces, please send them to me.
>
> Karl
>
>
> On Thu, Aug 31, 2017 at 12:12 PM, Beelz Ryuzaki <[email protected]>
> wrote:
>
>> Hi Karl,
>>
>> By 'other place', do you mean the \lib repository? If that so, then I
>> have already tried it and it didn't work.
>>
>> Othman.
>>
>> On Thu, 31 Aug 2017 at 18:07, Karl Wright <[email protected]> wrote:
>>
>>> Hi Othman,
>>>
>>> I used the java dependency inspector to see what the issue is and it
>>> turns out that poi-ooxml.jar does refer back to poi.jar in the class that
>>> is failing.  So you will need to move poi-3.15.jar and
>>> commons-collections4-1.4.jar to the other place as well.
>>>
>>> Let's hope that finally fixes this issue.
>>>
>>> I'm very unhappy about the quality of the POI project code; it is
>>> definitely not using reasonable engineering practices, and I will be
>>> opening a ticket with them.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Thu, Aug 31, 2017 at 11:57 AM, Beelz Ryuzaki <[email protected]>
>>> wrote:
>>>
>>>> I'm using the file based example and all the changes you told me to do.
>>>> I reproduced them in the file based example. I'll try to install zookeeper
>>>> and use the zookeeper example. Will I need a configuration to do in order
>>>> to run the zookeeper example ?
>>>>
>>>> Othman.
>>>>
>>>> On Thu, 31 Aug 2017 at 17:46, Karl Wright <[email protected]> wrote:
>>>>
>>>>> Are you using the zookeeper example, or the file-based example?
>>>>>
>>>>> If these jars have all been moved, and the options.env includes them,
>>>>> then I have to conclude that Apache POI's pom.xml is incorrect too.  It
>>>>> will take a while to figure out what's missing that poi-ooxml.jar needs
>>>>> that is not listed.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Aug 31, 2017 at 11:39 AM, Beelz Ryuzaki <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> All the dependencies you mentioned have already been added in the
>>>>>> options.env.win file in the multiprocess-file-example repository.
>>>>>>
>>>>>> On Thu, 31 Aug 2017 at 17:33, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I added it in the options.env.win file. Should it be the one in
>>>>>>> the multiprocess-zk-example document or multiprocess-file-example ?
>>>>>>>
>>>>>>> On Thu, 31 Aug 2017 at 17:30, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> It's not related at all to elasticsearch.
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Could it be a problem of elasticsearch's version ? I'm actually
>>>>>>>>> using 2.1.0 which is pretty old for this new version of ManifoldCF?
>>>>>>>>>
>>>>>>>>> Othman.
>>>>>>>>>
>>>>>>>>> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I moved back both the jars you mentioned and a different is
>>>>>>>>>> showing. You will find the stack trace attached.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Othman
>>>>>>>>>>
>>>>>>>>>> On Thu, 31 Aug 2017 at 17:09, Karl Wright <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I've looked at the dependencies; you should not have moved
>>>>>>>>>>> poi-3.15.jar.  Please move that back, and 
>>>>>>>>>>> commons-collections4-4.1.jar too.
>>>>>>>>>>>
>>>>>>>>>>> You *will* need to move curvesapi-1.04.jar though.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you include poi.jar, then all dependencies of poi.jar must
>>>>>>>>>>>> also be included.  This would mean that curvesapi-1.04.jar and
>>>>>>>>>>>> commons-collections4-4.1.jar should also be included.
>>>>>>>>>>>>
>>>>>>>>>>>> Karl
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Karl,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I added the two jars that you have mentioned and another one :
>>>>>>>>>>>>> poi-3.15.jar . Unfortunately, there is another error showing. 
>>>>>>>>>>>>> This time, it
>>>>>>>>>>>>> concerns excel files. You will find attached the stack trace.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, this shows that the jar we moved calls back into another
>>>>>>>>>>>>>> jar, which will also need to be moved.  *That* jar has yet 
>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> dependency too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The list of jars is thus extended to include:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> poi-ooxml-3.15.jar
>>>>>>>>>>>>>> dom4j-1.6.1.jar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You will find attached the stack trace. My apologies for the
>>>>>>>>>>>>>>> bad quality of the image, I'm doing my best to send you the 
>>>>>>>>>>>>>>> stack trace as
>>>>>>>>>>>>>>> I don't have the right to send documents outside the company.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for your time,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once again, I need a stack trace to diagnose what the
>>>>>>>>>>>>>>>> problem is.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Oh, actually it didn't solve the problem. I looked into
>>>>>>>>>>>>>>>>> the log file and saw the following error:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader
>>>>>>>>>>>>>>>>> java.lang.NoClassDefFoundError:
>>>>>>>>>>>>>>>>> org/apache/poi/POIXMLTypeLoader.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe another jar is missing ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have tried what you told me to do, and you expected the
>>>>>>>>>>>>>>>>>> crawling resumed. How about the regular expressions? How can 
>>>>>>>>>>>>>>>>>> I make complex
>>>>>>>>>>>>>>>>>> regular expressions in the job's paths tab ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you very much for your help.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ok, I will try it right away and let you know if it
>>>>>>>>>>>>>>>>>>> works.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Oh, and you also may need to edit your options.env
>>>>>>>>>>>>>>>>>>>> files to include them in the classpath for startup.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If you are amenable, there is another workaround you
>>>>>>>>>>>>>>>>>>>>> could try.  Specifically:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (1) Shut down all MCF processes.
>>>>>>>>>>>>>>>>>>>>> (2) Move the following two files from
>>>>>>>>>>>>>>>>>>>>> connector-common-lib to lib:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> xmlbeans-2.6.0.jar
>>>>>>>>>>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> (3) Restart everything and see if your crawl resumes.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Please let me know what happens.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> One simple workaround is to use the external Tika
>>>>>>>>>>>>>>>>>>>>>> server transformer rather than the embedded Tika 
>>>>>>>>>>>>>>>>>>>>>> Extractor.  I'm still
>>>>>>>>>>>>>>>>>>>>>> looking into why the jar is not being found.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Yes, I'm actually using the latest binary version,
>>>>>>>>>>>>>>>>>>>>>>> and my job got stuck on that specific file.
>>>>>>>>>>>>>>>>>>>>>>> The job status is still Running. You can see it in
>>>>>>>>>>>>>>>>>>>>>>> the attached file. For your information, the job 
>>>>>>>>>>>>>>>>>>>>>>> started yesterday.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing.
>>>>>>>>>>>>>>>>>>>>>>>> I think we will need a ticket to address this, if
>>>>>>>>>>>>>>>>>>>>>>>> you are indeed using the binary distribution.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm actually using the binary version. For
>>>>>>>>>>>>>>>>>>>>>>>>> security reasons, I can't send any files from my 
>>>>>>>>>>>>>>>>>>>>>>>>> computer. I have copied
>>>>>>>>>>>>>>>>>>>>>>>>> the stack trace and scanned it with my cellphone. I 
>>>>>>>>>>>>>>>>>>>>>>>>> hope it will be
>>>>>>>>>>>>>>>>>>>>>>>>> helpful. Meanwhile, I have read the documentation 
>>>>>>>>>>>>>>>>>>>>>>>>> about how to restrict the
>>>>>>>>>>>>>>>>>>>>>>>>> crawling and I don't think the '|' works in the 
>>>>>>>>>>>>>>>>>>>>>>>>> specified. For instance, I
>>>>>>>>>>>>>>>>>>>>>>>>> would like to restrict the crawling for the documents 
>>>>>>>>>>>>>>>>>>>>>>>>> that counts the
>>>>>>>>>>>>>>>>>>>>>>>>> 'sound' word . I proceed as follows: *(SON)* . the 
>>>>>>>>>>>>>>>>>>>>>>>>> document is with capital
>>>>>>>>>>>>>>>>>>>>>>>>> letters and I noticed that it didn't take it into 
>>>>>>>>>>>>>>>>>>>>>>>>> consideration.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The way you restrict documents with the windows
>>>>>>>>>>>>>>>>>>>>>>>>>> share connector is by specifying information on the 
>>>>>>>>>>>>>>>>>>>>>>>>>> "Paths" tab in jobs
>>>>>>>>>>>>>>>>>>>>>>>>>> that crawl windows shares.  There is end-user 
>>>>>>>>>>>>>>>>>>>>>>>>>> documentation both online and
>>>>>>>>>>>>>>>>>>>>>>>>>> distributed with all binary distributions that 
>>>>>>>>>>>>>>>>>>>>>>>>>> describe how to do this.
>>>>>>>>>>>>>>>>>>>>>>>>>> Have you found it?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you for your response, I will start using
>>>>>>>>>>>>>>>>>>>>>>>>>>> zookeeper and I will let you know if it works. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> have another question to
>>>>>>>>>>>>>>>>>>>>>>>>>>> ask. Actually, I need to make some filters while 
>>>>>>>>>>>>>>>>>>>>>>>>>>> crawling. I don't want to
>>>>>>>>>>>>>>>>>>>>>>>>>>> crawl some files and some folders. Could you give 
>>>>>>>>>>>>>>>>>>>>>>>>>>> me an example of how to
>>>>>>>>>>>>>>>>>>>>>>>>>>> use the regex. Does the regex allow to use /i to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ignore cases ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Othman
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Beelz,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people
>>>>>>>>>>>>>>>>>>>>>>>>>>>> often have problems with getting file permissions 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> right, and they do not
>>>>>>>>>>>>>>>>>>>>>>>>>>>> understand how to shut processes down cleanly, and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> zookeeper is resilient
>>>>>>>>>>>>>>>>>>>>>>>>>>>> against that.  I highly recommend using zookeeper 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> sync.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into
>>>>>>>>>>>>>>>>>>>>>>>>>>>> memory so you do not need huge amounts of memory.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> The default values are
>>>>>>>>>>>>>>>>>>>>>>>>>>>> more than enough for 35,000 files, which is a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> pretty small job for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how is zookeeper different from file based 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sync? I also need a
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> guidance on how to manage my pc's memory. How 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> many Go should I allocate for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the start-agent of ManifoldCF? Is 4Go enough in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> order to crawler 35K files ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and that's interfering with ManifoldCF 2.8 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> locking.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would suggest two things:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file-based sync.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> after that.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryuzaki <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> have looked into the ManifoldCF log file and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> extracted the following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> warnings :
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \multiprocess-file-example\.\.\synch
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> denied.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> full. Shutting down process; locks may be left 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dangling. You must cleanup
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> before restarting.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elasticsearch output connection. Moreover, the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> job uses Tika to extract
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metadata and a file system as a repository 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connection. During the job, I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> don't extract the content of the documents. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> was wandering if the issue
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> comes from elasticsearch ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Othman.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Othman,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that looks like it might go away on retry, but 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not.  It can be either
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the repository side or on the output side.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you look at the Simple
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> file, you should be able to get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong.  Without 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> further information, I can't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> say any more.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Ryuzaki <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from société générale in France. I'm actually 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using your recent version of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> manifoldCF 2.8 . I'm working on an internal 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> search engine. For this reason,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm using manifoldcf in order to index 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> documents on windows shares. I
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encountered a serious problem while crawling 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 35K documents. Most of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time, when manifoldcf start crawling a big 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sized documents (19Mo for
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> example), it ends the job with the following 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> error: repeated service
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure processing document : 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> software caused connection
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this problem, please ?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2.1.0 .
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>

Re: Question about ManifoldCF 2.8

Reply via email to