Ok, I will try it right away and let you know if it works. Othman.
On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote: > Oh, and you also may need to edit your options.env files to include them > in the classpath for startup. > > Karl > > > On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote: > >> If you are amenable, there is another workaround you could try. >> Specifically: >> >> (1) Shut down all MCF processes. >> (2) Move the following two files from connector-common-lib to lib: >> >> xmlbeans-2.6.0.jar >> poi-ooxml-schemas-3.15.jar >> >> (3) Restart everything and see if your crawl resumes. >> >> Please let me know what happens. >> >> Karl >> >> >> >> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> wrote: >> >>> I created a ticket for this: CONNECTORS-1450. >>> >>> One simple workaround is to use the external Tika server transformer >>> rather than the embedded Tika Extractor. I'm still looking into why the >>> jar is not being found. >>> >>> Karl >>> >>> >>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]> >>> wrote: >>> >>>> Yes, I'm actually using the latest binary version, and my job got stuck >>>> on that specific file. >>>> The job status is still Running. You can see it in the attached file. >>>> For your information, the job started yesterday. >>>> >>>> Thanks, >>>> >>>> Othman >>>> >>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote: >>>> >>>>> It looks like a dependency of Apache POI is missing. >>>>> I think we will need a ticket to address this, if you are indeed using >>>>> the binary distribution. >>>>> >>>>> Thanks! >>>>> Karl >>>>> >>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> >>>>> wrote: >>>>> >>>>>> I'm actually using the binary version. For security reasons, I can't >>>>>> send any files from my computer. I have copied the stack trace and >>>>>> scanned >>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read >>>>>> the >>>>>> documentation about how to restrict the crawling and I don't think the >>>>>> '|' >>>>>> works in the specified. For instance, I would like to restrict the >>>>>> crawling >>>>>> for the documents that counts the 'sound' word . I proceed as follows: >>>>>> *(SON)* . the document is with capital letters and I noticed that it >>>>>> didn't >>>>>> take it into consideration. >>>>>> >>>>>> Thanks, >>>>>> Othman >>>>>> >>>>>> >>>>>> >>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> wrote: >>>>>> >>>>>>> Hi Othman, >>>>>>> >>>>>>> The way you restrict documents with the windows share connector is >>>>>>> by specifying information on the "Paths" tab in jobs that crawl windows >>>>>>> shares. There is end-user documentation both online and distributed >>>>>>> with >>>>>>> all binary distributions that describe how to do this. Have you found >>>>>>> it? >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello Karl, >>>>>>>> >>>>>>>> Thank you for your response, I will start using zookeeper and I >>>>>>>> will let you know if it works. I have another question to ask. >>>>>>>> Actually, I >>>>>>>> need to make some filters while crawling. I don't want to crawl some >>>>>>>> files >>>>>>>> and some folders. Could you give me an example of how to use the regex. >>>>>>>> Does the regex allow to use /i to ignore cases ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Othman >>>>>>>> >>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Beelz, >>>>>>>>> >>>>>>>>> File-based sync is deprecated because people often have problems >>>>>>>>> with getting file permissions right, and they do not understand how >>>>>>>>> to shut >>>>>>>>> processes down cleanly, and zookeeper is resilient against that. I >>>>>>>>> highly >>>>>>>>> recommend using zookeeper sync. >>>>>>>>> >>>>>>>>> ManifoldCF is engineered to not put files into memory so you do >>>>>>>>> not need huge amounts of memory. The default values are more than >>>>>>>>> enough >>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I'm actually not using zookeeper. i want to know how is zookeeper >>>>>>>>>> different from file based sync? I also need a guidance on how to >>>>>>>>>> manage my >>>>>>>>>> pc's memory. How many Go should I allocate for the start-agent of >>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>> >>>>>>>>>> Othman. >>>>>>>>>> >>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>> >>>>>>>>>>> I would suggest two things: >>>>>>>>>>> >>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Karl >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>> >>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into >>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings : >>>>>>>>>>>> >>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) >>>>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>>>>>>>> process; locks may be left dangling. You must cleanup before >>>>>>>>>>>> restarting. >>>>>>>>>>>> >>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a >>>>>>>>>>>> file >>>>>>>>>>>> system as a repository connection. During the job, I don't extract >>>>>>>>>>>> the >>>>>>>>>>>> content of the documents. I was wandering if the issue comes from >>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>> >>>>>>>>>>>> Othman. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>> >>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like it >>>>>>>>>>>>> might go away on retry, but does not. It can be either on the >>>>>>>>>>>>> repository >>>>>>>>>>>>> side or on the output side. If you look at the Simple History in >>>>>>>>>>>>> the UI, >>>>>>>>>>>>> or at the manifoldcf.log file, you should be able to get a better >>>>>>>>>>>>> sense of >>>>>>>>>>>>> what went wrong. Without further information, I can't say any >>>>>>>>>>>>> more. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale >>>>>>>>>>>>>> in France. I'm actually using your recent version of manifoldCF >>>>>>>>>>>>>> 2.8 . I'm >>>>>>>>>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>>>>>>>>> manifoldcf >>>>>>>>>>>>>> in order to index documents on windows shares. I encountered a >>>>>>>>>>>>>> serious >>>>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when >>>>>>>>>>>>>> manifoldcf >>>>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it ends >>>>>>>>>>>>>> the job >>>>>>>>>>>>>> with the following error: repeated service interruptions - >>>>>>>>>>>>>> failure >>>>>>>>>>>>>> processing document : software caused connection abort: socket >>>>>>>>>>>>>> write error. >>>>>>>>>>>>>> Can you give me some tips on how to solve this problem, >>>>>>>>>>>>>> please ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> >
