Hello Karl, Thank you for your response, I will start using zookeeper and I will let you know if it works. I have another question to ask. Actually, I need to make some filters while crawling. I don't want to crawl some files and some folders. Could you give me an example of how to use the regex. Does the regex allow to use /i to ignore cases ?
Thanks, Othman On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote: > Hi Beelz, > > File-based sync is deprecated because people often have problems with > getting file permissions right, and they do not understand how to shut > processes down cleanly, and zookeeper is resilient against that. I highly > recommend using zookeeper sync. > > ManifoldCF is engineered to not put files into memory so you do not need > huge amounts of memory. The default values are more than enough for 35,000 > files, which is a pretty small job for ManifoldCF. > > Thanks, > Karl > > > On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected]> > wrote: > >> I'm actually not using zookeeper. i want to know how is zookeeper >> different from file based sync? I also need a guidance on how to manage my >> pc's memory. How many Go should I allocate for the start-agent of >> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >> >> Othman. >> >> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> wrote: >> >>> Your disk is not writable for some reason, and that's interfering with >>> ManifoldCF 2.8 locking. >>> >>> I would suggest two things: >>> >>> (1) Use Zookeeper for sync instead of file-based sync. >>> (2) Have a look if you still get failures after that. >>> >>> Thanks, >>> Karl >>> >>> >>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <[email protected]> >>> wrote: >>> >>>> Hi Mr Karl, >>>> >>>> Thank you Mr Karl for your quick response. I have looked into the >>>> ManifoldCF log file and extracted the following warnings : >>>> >>>> - Attempt to set file lock >>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) >>>> Synapses.lock' failed : Access is denied. >>>> >>>> >>>> - Couldn't write to lock file; disk may be full. Shutting down process; >>>> locks may be left dangling. You must cleanup before restarting. >>>> >>>> ES (lowercase) synapses being the elasticsearch output connection. >>>> Moreover, the job uses Tika to extract metadata and a file system as a >>>> repository connection. During the job, I don't extract the content of the >>>> documents. I was wandering if the issue comes from elasticsearch ? >>>> >>>> Othman. >>>> >>>> >>>> >>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Othman, >>>>> >>>>> ManifoldCF aborts a job if there's an error that looks like it might >>>>> go away on retry, but does not. It can be either on the repository side >>>>> or >>>>> on the output side. If you look at the Simple History in the UI, or at >>>>> the >>>>> manifoldcf.log file, you should be able to get a better sense of what went >>>>> wrong. Without further information, I can't say any more. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I'm Othman Belhaj, a software engineer from société générale in >>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm >>>>>> working on an internal search engine. For this reason, I'm using >>>>>> manifoldcf >>>>>> in order to index documents on windows shares. I encountered a serious >>>>>> problem while crawling 35K documents. Most of the time, when manifoldcf >>>>>> start crawling a big sized documents (19Mo for example), it ends the job >>>>>> with the following error: repeated service interruptions - failure >>>>>> processing document : software caused connection abort: socket write >>>>>> error. >>>>>> Can you give me some tips on how to solve this problem, please ? >>>>>> >>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>> I'm looking forward for your response. >>>>>> >>>>>> Best regards, >>>>>> >>>>>> Othman BELHAJ >>>>>> >>>>> >>>>> >>> >
