Oh, actually it didn't solve the problem. I looked into the log file and saw the following error:
Error tossed : org/apache/poi/POIXMLTypeLoader java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. Maybe another jar is missing ? Othman. On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> wrote: > I have tried what you told me to do, and you expected the crawling > resumed. How about the regular expressions? How can I make complex regular > expressions in the job's paths tab ? > > Thank you very much for your help. > > Othman. > > > On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> wrote: > >> Ok, I will try it right away and let you know if it works. >> >> Othman. >> >> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote: >> >>> Oh, and you also may need to edit your options.env files to include them >>> in the classpath for startup. >>> >>> Karl >>> >>> >>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote: >>> >>>> If you are amenable, there is another workaround you could try. >>>> Specifically: >>>> >>>> (1) Shut down all MCF processes. >>>> (2) Move the following two files from connector-common-lib to lib: >>>> >>>> xmlbeans-2.6.0.jar >>>> poi-ooxml-schemas-3.15.jar >>>> >>>> (3) Restart everything and see if your crawl resumes. >>>> >>>> Please let me know what happens. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> I created a ticket for this: CONNECTORS-1450. >>>>> >>>>> One simple workaround is to use the external Tika server transformer >>>>> rather than the embedded Tika Extractor. I'm still looking into why the >>>>> jar is not being found. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]> >>>>> wrote: >>>>> >>>>>> Yes, I'm actually using the latest binary version, and my job got >>>>>> stuck on that specific file. >>>>>> The job status is still Running. You can see it in the attached file. >>>>>> For your information, the job started yesterday. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Othman >>>>>> >>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote: >>>>>> >>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>> I think we will need a ticket to address this, if you are indeed >>>>>>> using the binary distribution. >>>>>>> >>>>>>> Thanks! >>>>>>> Karl >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm actually using the binary version. For security reasons, I >>>>>>>> can't send any files from my computer. I have copied the stack trace >>>>>>>> and >>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, I >>>>>>>> have >>>>>>>> read the documentation about how to restrict the crawling and I don't >>>>>>>> think >>>>>>>> the '|' works in the specified. For instance, I would like to restrict >>>>>>>> the >>>>>>>> crawling for the documents that counts the 'sound' word . I proceed as >>>>>>>> follows: *(SON)* . the document is with capital letters and I noticed >>>>>>>> that >>>>>>>> it didn't take it into consideration. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Othman >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Othman, >>>>>>>>> >>>>>>>>> The way you restrict documents with the windows share connector is >>>>>>>>> by specifying information on the "Paths" tab in jobs that crawl >>>>>>>>> windows >>>>>>>>> shares. There is end-user documentation both online and distributed >>>>>>>>> with >>>>>>>>> all binary distributions that describe how to do this. Have you >>>>>>>>> found it? >>>>>>>>> >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hello Karl, >>>>>>>>>> >>>>>>>>>> Thank you for your response, I will start using zookeeper and I >>>>>>>>>> will let you know if it works. I have another question to ask. >>>>>>>>>> Actually, I >>>>>>>>>> need to make some filters while crawling. I don't want to crawl some >>>>>>>>>> files >>>>>>>>>> and some folders. Could you give me an example of how to use the >>>>>>>>>> regex. >>>>>>>>>> Does the regex allow to use /i to ignore cases ? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Othman >>>>>>>>>> >>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Beelz, >>>>>>>>>>> >>>>>>>>>>> File-based sync is deprecated because people often have problems >>>>>>>>>>> with getting file permissions right, and they do not understand how >>>>>>>>>>> to shut >>>>>>>>>>> processes down cleanly, and zookeeper is resilient against that. I >>>>>>>>>>> highly >>>>>>>>>>> recommend using zookeeper sync. >>>>>>>>>>> >>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do >>>>>>>>>>> not need huge amounts of memory. The default values are more than >>>>>>>>>>> enough >>>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>> zookeeper different from file based sync? I also need a guidance >>>>>>>>>>>> on how to >>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>>> start-agent of >>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>> >>>>>>>>>>>> Othman. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>> >>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>> >>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into >>>>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings : >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) >>>>>>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting >>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup before >>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and >>>>>>>>>>>>>> a file >>>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>>> extract the >>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes from >>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like >>>>>>>>>>>>>>> it might go away on retry, but does not. It can be either on >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> repository side or on the output side. If you look at the >>>>>>>>>>>>>>> Simple History >>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able to >>>>>>>>>>>>>>> get a >>>>>>>>>>>>>>> better sense of what went wrong. Without further information, >>>>>>>>>>>>>>> I can't say >>>>>>>>>>>>>>> any more. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version of >>>>>>>>>>>>>>>> manifoldCF >>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this >>>>>>>>>>>>>>>> reason, I'm using >>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I >>>>>>>>>>>>>>>> encountered a >>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the >>>>>>>>>>>>>>>> time, when >>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for >>>>>>>>>>>>>>>> example), it ends >>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>> processing document : software caused connection abort: socket >>>>>>>>>>>>>>>> write error. >>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem, >>>>>>>>>>>>>>>> please ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>>
