If you are amenable, there is another workaround you could try. Specifically:
(1) Shut down all MCF processes. (2) Move the following two files from connector-common-lib to lib: xmlbeans-2.6.0.jar poi-ooxml-schemas-3.15.jar (3) Restart everything and see if your crawl resumes. Please let me know what happens. Karl On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> wrote: > I created a ticket for this: CONNECTORS-1450. > > One simple workaround is to use the external Tika server transformer > rather than the embedded Tika Extractor. I'm still looking into why the > jar is not being found. > > Karl > > > On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]> > wrote: > >> Yes, I'm actually using the latest binary version, and my job got stuck >> on that specific file. >> The job status is still Running. You can see it in the attached file. For >> your information, the job started yesterday. >> >> Thanks, >> >> Othman >> >> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote: >> >>> It looks like a dependency of Apache POI is missing. >>> I think we will need a ticket to address this, if you are indeed using >>> the binary distribution. >>> >>> Thanks! >>> Karl >>> >>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> >>> wrote: >>> >>>> I'm actually using the binary version. For security reasons, I can't >>>> send any files from my computer. I have copied the stack trace and scanned >>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read the >>>> documentation about how to restrict the crawling and I don't think the '|' >>>> works in the specified. For instance, I would like to restrict the crawling >>>> for the documents that counts the 'sound' word . I proceed as follows: >>>> *(SON)* . the document is with capital letters and I noticed that it didn't >>>> take it into consideration. >>>> >>>> Thanks, >>>> Othman >>>> >>>> >>>> >>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Othman, >>>>> >>>>> The way you restrict documents with the windows share connector is by >>>>> specifying information on the "Paths" tab in jobs that crawl windows >>>>> shares. There is end-user documentation both online and distributed with >>>>> all binary distributions that describe how to do this. Have you found it? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello Karl, >>>>>> >>>>>> Thank you for your response, I will start using zookeeper and I will >>>>>> let you know if it works. I have another question to ask. Actually, I >>>>>> need >>>>>> to make some filters while crawling. I don't want to crawl some files and >>>>>> some folders. Could you give me an example of how to use the regex. Does >>>>>> the regex allow to use /i to ignore cases ? >>>>>> >>>>>> Thanks, >>>>>> Othman >>>>>> >>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote: >>>>>> >>>>>>> Hi Beelz, >>>>>>> >>>>>>> File-based sync is deprecated because people often have problems >>>>>>> with getting file permissions right, and they do not understand how to >>>>>>> shut >>>>>>> processes down cleanly, and zookeeper is resilient against that. I >>>>>>> highly >>>>>>> recommend using zookeeper sync. >>>>>>> >>>>>>> ManifoldCF is engineered to not put files into memory so you do not >>>>>>> need huge amounts of memory. The default values are more than enough >>>>>>> for >>>>>>> 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper >>>>>>>> different from file based sync? I also need a guidance on how to >>>>>>>> manage my >>>>>>>> pc's memory. How many Go should I allocate for the start-agent of >>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>> >>>>>>>> Othman. >>>>>>>> >>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Your disk is not writable for some reason, and that's interfering >>>>>>>>> with ManifoldCF 2.8 locking. >>>>>>>>> >>>>>>>>> I would suggest two things: >>>>>>>>> >>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Mr Karl, >>>>>>>>>> >>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into the >>>>>>>>>> ManifoldCF log file and extracted the following warnings : >>>>>>>>>> >>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.8 >>>>>>>>>> \multiprocess-file-example\.\.\synch >>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) >>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>>>>>> process; locks may be left dangling. You must cleanup before >>>>>>>>>> restarting. >>>>>>>>>> >>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a >>>>>>>>>> file >>>>>>>>>> system as a repository connection. During the job, I don't extract >>>>>>>>>> the >>>>>>>>>> content of the documents. I was wandering if the issue comes from >>>>>>>>>> elasticsearch ? >>>>>>>>>> >>>>>>>>>> Othman. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Othman, >>>>>>>>>>> >>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like it >>>>>>>>>>> might go away on retry, but does not. It can be either on the >>>>>>>>>>> repository >>>>>>>>>>> side or on the output side. If you look at the Simple History in >>>>>>>>>>> the UI, >>>>>>>>>>> or at the manifoldcf.log file, you should be able to get a better >>>>>>>>>>> sense of >>>>>>>>>>> what went wrong. Without further information, I can't say any more. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale in >>>>>>>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . >>>>>>>>>>>> I'm >>>>>>>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>>>>>>> manifoldcf >>>>>>>>>>>> in order to index documents on windows shares. I encountered a >>>>>>>>>>>> serious >>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when >>>>>>>>>>>> manifoldcf >>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it ends >>>>>>>>>>>> the job >>>>>>>>>>>> with the following error: repeated service interruptions - failure >>>>>>>>>>>> processing document : software caused connection abort: socket >>>>>>>>>>>> write error. >>>>>>>>>>>> Can you give me some tips on how to solve this problem, please >>>>>>>>>>>> ? >>>>>>>>>>>> >>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> >>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >
