It looks like a dependency of Apache POI is missing. I think we will need a ticket to address this, if you are indeed using the binary distribution.
Thanks! Karl On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> wrote: > I'm actually using the binary version. For security reasons, I can't send > any files from my computer. I have copied the stack trace and scanned it > with my cellphone. I hope it will be helpful. Meanwhile, I have read the > documentation about how to restrict the crawling and I don't think the '|' > works in the specified. For instance, I would like to restrict the crawling > for the documents that counts the 'sound' word . I proceed as follows: > *(SON)* . the document is with capital letters and I noticed that it didn't > take it into consideration. > > Thanks, > Othman > > > > On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> wrote: > >> Hi Othman, >> >> The way you restrict documents with the windows share connector is by >> specifying information on the "Paths" tab in jobs that crawl windows >> shares. There is end-user documentation both online and distributed with >> all binary distributions that describe how to do this. Have you found it? >> >> Karl >> >> >> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]> >> wrote: >> >>> Hello Karl, >>> >>> Thank you for your response, I will start using zookeeper and I will let >>> you know if it works. I have another question to ask. Actually, I need to >>> make some filters while crawling. I don't want to crawl some files and some >>> folders. Could you give me an example of how to use the regex. Does the >>> regex allow to use /i to ignore cases ? >>> >>> Thanks, >>> Othman >>> >>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote: >>> >>>> Hi Beelz, >>>> >>>> File-based sync is deprecated because people often have problems with >>>> getting file permissions right, and they do not understand how to shut >>>> processes down cleanly, and zookeeper is resilient against that. I highly >>>> recommend using zookeeper sync. >>>> >>>> ManifoldCF is engineered to not put files into memory so you do not >>>> need huge amounts of memory. The default values are more than enough for >>>> 35,000 files, which is a pretty small job for ManifoldCF. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected]> >>>> wrote: >>>> >>>>> I'm actually not using zookeeper. i want to know how is zookeeper >>>>> different from file based sync? I also need a guidance on how to manage my >>>>> pc's memory. How many Go should I allocate for the start-agent of >>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>> >>>>> Othman. >>>>> >>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> wrote: >>>>> >>>>>> Your disk is not writable for some reason, and that's interfering >>>>>> with ManifoldCF 2.8 locking. >>>>>> >>>>>> I would suggest two things: >>>>>> >>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>> (2) Have a look if you still get failures after that. >>>>>> >>>>>> Thanks, >>>>>> Karl >>>>>> >>>>>> >>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Mr Karl, >>>>>>> >>>>>>> Thank you Mr Karl for your quick response. I have looked into the >>>>>>> ManifoldCF log file and extracted the following warnings : >>>>>>> >>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>> >>>>>>> >>>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>>> process; locks may be left dangling. You must cleanup before restarting. >>>>>>> >>>>>>> ES (lowercase) synapses being the elasticsearch output connection. >>>>>>> Moreover, the job uses Tika to extract metadata and a file system as a >>>>>>> repository connection. During the job, I don't extract the content of >>>>>>> the >>>>>>> documents. I was wandering if the issue comes from elasticsearch ? >>>>>>> >>>>>>> Othman. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Othman, >>>>>>>> >>>>>>>> ManifoldCF aborts a job if there's an error that looks like it >>>>>>>> might go away on retry, but does not. It can be either on the >>>>>>>> repository >>>>>>>> side or on the output side. If you look at the Simple History in the >>>>>>>> UI, >>>>>>>> or at the manifoldcf.log file, you should be able to get a better >>>>>>>> sense of >>>>>>>> what went wrong. Without further information, I can't say any more. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I'm Othman Belhaj, a software engineer from société générale in >>>>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm >>>>>>>>> working on an internal search engine. For this reason, I'm using >>>>>>>>> manifoldcf >>>>>>>>> in order to index documents on windows shares. I encountered a serious >>>>>>>>> problem while crawling 35K documents. Most of the time, when >>>>>>>>> manifoldcf >>>>>>>>> start crawling a big sized documents (19Mo for example), it ends the >>>>>>>>> job >>>>>>>>> with the following error: repeated service interruptions - failure >>>>>>>>> processing document : software caused connection abort: socket write >>>>>>>>> error. >>>>>>>>> Can you give me some tips on how to solve this problem, please ? >>>>>>>>> >>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>> I'm looking forward for your response. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> Othman BELHAJ >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>> >>
