Hi Othman, The way you restrict documents with the windows share connector is by specifying information on the "Paths" tab in jobs that crawl windows shares. There is end-user documentation both online and distributed with all binary distributions that describe how to do this. Have you found it?
Karl On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]> wrote: > Hello Karl, > > Thank you for your response, I will start using zookeeper and I will let > you know if it works. I have another question to ask. Actually, I need to > make some filters while crawling. I don't want to crawl some files and some > folders. Could you give me an example of how to use the regex. Does the > regex allow to use /i to ignore cases ? > > Thanks, > Othman > > On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote: > >> Hi Beelz, >> >> File-based sync is deprecated because people often have problems with >> getting file permissions right, and they do not understand how to shut >> processes down cleanly, and zookeeper is resilient against that. I highly >> recommend using zookeeper sync. >> >> ManifoldCF is engineered to not put files into memory so you do not need >> huge amounts of memory. The default values are more than enough for 35,000 >> files, which is a pretty small job for ManifoldCF. >> >> Thanks, >> Karl >> >> >> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected]> >> wrote: >> >>> I'm actually not using zookeeper. i want to know how is zookeeper >>> different from file based sync? I also need a guidance on how to manage my >>> pc's memory. How many Go should I allocate for the start-agent of >>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>> >>> Othman. >>> >>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> wrote: >>> >>>> Your disk is not writable for some reason, and that's interfering with >>>> ManifoldCF 2.8 locking. >>>> >>>> I would suggest two things: >>>> >>>> (1) Use Zookeeper for sync instead of file-based sync. >>>> (2) Have a look if you still get failures after that. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <[email protected]> >>>> wrote: >>>> >>>>> Hi Mr Karl, >>>>> >>>>> Thank you Mr Karl for your quick response. I have looked into the >>>>> ManifoldCF log file and extracted the following warnings : >>>>> >>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>> 8\multiprocess-file-example\.\.\synch >>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>> >>>>> >>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>> process; locks may be left dangling. You must cleanup before restarting. >>>>> >>>>> ES (lowercase) synapses being the elasticsearch output connection. >>>>> Moreover, the job uses Tika to extract metadata and a file system as a >>>>> repository connection. During the job, I don't extract the content of the >>>>> documents. I was wandering if the issue comes from elasticsearch ? >>>>> >>>>> Othman. >>>>> >>>>> >>>>> >>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> wrote: >>>>> >>>>>> Hi Othman, >>>>>> >>>>>> ManifoldCF aborts a job if there's an error that looks like it might >>>>>> go away on retry, but does not. It can be either on the repository side >>>>>> or >>>>>> on the output side. If you look at the Simple History in the UI, or at >>>>>> the >>>>>> manifoldcf.log file, you should be able to get a better sense of what >>>>>> went >>>>>> wrong. Without further information, I can't say any more. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I'm Othman Belhaj, a software engineer from société générale in >>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm >>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>> manifoldcf >>>>>>> in order to index documents on windows shares. I encountered a serious >>>>>>> problem while crawling 35K documents. Most of the time, when manifoldcf >>>>>>> start crawling a big sized documents (19Mo for example), it ends the job >>>>>>> with the following error: repeated service interruptions - failure >>>>>>> processing document : software caused connection abort: socket write >>>>>>> error. >>>>>>> Can you give me some tips on how to solve this problem, please ? >>>>>>> >>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>> I'm looking forward for your response. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Othman BELHAJ >>>>>>> >>>>>> >>>>>> >>>> >>
