I have tried what you told me to do, and you expected the crawling resumed. How about the regular expressions? How can I make complex regular expressions in the job's paths tab ?
Thank you very much for your help. Othman. On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> wrote: > Ok, I will try it right away and let you know if it works. > > Othman. > > On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote: > >> Oh, and you also may need to edit your options.env files to include them >> in the classpath for startup. >> >> Karl >> >> >> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> wrote: >> >>> If you are amenable, there is another workaround you could try. >>> Specifically: >>> >>> (1) Shut down all MCF processes. >>> (2) Move the following two files from connector-common-lib to lib: >>> >>> xmlbeans-2.6.0.jar >>> poi-ooxml-schemas-3.15.jar >>> >>> (3) Restart everything and see if your crawl resumes. >>> >>> Please let me know what happens. >>> >>> Karl >>> >>> >>> >>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> wrote: >>> >>>> I created a ticket for this: CONNECTORS-1450. >>>> >>>> One simple workaround is to use the external Tika server transformer >>>> rather than the embedded Tika Extractor. I'm still looking into why the >>>> jar is not being found. >>>> >>>> Karl >>>> >>>> >>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected]> >>>> wrote: >>>> >>>>> Yes, I'm actually using the latest binary version, and my job got >>>>> stuck on that specific file. >>>>> The job status is still Running. You can see it in the attached file. >>>>> For your information, the job started yesterday. >>>>> >>>>> Thanks, >>>>> >>>>> Othman >>>>> >>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> wrote: >>>>> >>>>>> It looks like a dependency of Apache POI is missing. >>>>>> I think we will need a ticket to address this, if you are indeed >>>>>> using the binary distribution. >>>>>> >>>>>> Thanks! >>>>>> Karl >>>>>> >>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I'm actually using the binary version. For security reasons, I can't >>>>>>> send any files from my computer. I have copied the stack trace and >>>>>>> scanned >>>>>>> it with my cellphone. I hope it will be helpful. Meanwhile, I have read >>>>>>> the >>>>>>> documentation about how to restrict the crawling and I don't think the >>>>>>> '|' >>>>>>> works in the specified. For instance, I would like to restrict the >>>>>>> crawling >>>>>>> for the documents that counts the 'sound' word . I proceed as follows: >>>>>>> *(SON)* . the document is with capital letters and I noticed that it >>>>>>> didn't >>>>>>> take it into consideration. >>>>>>> >>>>>>> Thanks, >>>>>>> Othman >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Othman, >>>>>>>> >>>>>>>> The way you restrict documents with the windows share connector is >>>>>>>> by specifying information on the "Paths" tab in jobs that crawl windows >>>>>>>> shares. There is end-user documentation both online and distributed >>>>>>>> with >>>>>>>> all binary distributions that describe how to do this. Have you found >>>>>>>> it? >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hello Karl, >>>>>>>>> >>>>>>>>> Thank you for your response, I will start using zookeeper and I >>>>>>>>> will let you know if it works. I have another question to ask. >>>>>>>>> Actually, I >>>>>>>>> need to make some filters while crawling. I don't want to crawl some >>>>>>>>> files >>>>>>>>> and some folders. Could you give me an example of how to use the >>>>>>>>> regex. >>>>>>>>> Does the regex allow to use /i to ignore cases ? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Othman >>>>>>>>> >>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Beelz, >>>>>>>>>> >>>>>>>>>> File-based sync is deprecated because people often have problems >>>>>>>>>> with getting file permissions right, and they do not understand how >>>>>>>>>> to shut >>>>>>>>>> processes down cleanly, and zookeeper is resilient against that. I >>>>>>>>>> highly >>>>>>>>>> recommend using zookeeper sync. >>>>>>>>>> >>>>>>>>>> ManifoldCF is engineered to not put files into memory so you do >>>>>>>>>> not need huge amounts of memory. The default values are more than >>>>>>>>>> enough >>>>>>>>>> for 35,000 files, which is a pretty small job for ManifoldCF. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>> zookeeper different from file based sync? I also need a guidance on >>>>>>>>>>> how to >>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>> start-agent of >>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>> >>>>>>>>>>> Othman. >>>>>>>>>>> >>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>> >>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>> >>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked into >>>>>>>>>>>>> the ManifoldCF log file and extracted the following warnings : >>>>>>>>>>>>> >>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES (Lowercase) >>>>>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down >>>>>>>>>>>>> process; locks may be left dangling. You must cleanup before >>>>>>>>>>>>> restarting. >>>>>>>>>>>>> >>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata and a >>>>>>>>>>>>> file >>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>> extract the >>>>>>>>>>>>> content of the documents. I was wandering if the issue comes from >>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>> >>>>>>>>>>>>> Othman. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>> >>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks like >>>>>>>>>>>>>> it might go away on retry, but does not. It can be either on the >>>>>>>>>>>>>> repository side or on the output side. If you look at the >>>>>>>>>>>>>> Simple History >>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able to >>>>>>>>>>>>>> get a >>>>>>>>>>>>>> better sense of what went wrong. Without further information, I >>>>>>>>>>>>>> can't say >>>>>>>>>>>>>> any more. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale >>>>>>>>>>>>>>> in France. I'm actually using your recent version of manifoldCF >>>>>>>>>>>>>>> 2.8 . I'm >>>>>>>>>>>>>>> working on an internal search engine. For this reason, I'm >>>>>>>>>>>>>>> using manifoldcf >>>>>>>>>>>>>>> in order to index documents on windows shares. I encountered a >>>>>>>>>>>>>>> serious >>>>>>>>>>>>>>> problem while crawling 35K documents. Most of the time, when >>>>>>>>>>>>>>> manifoldcf >>>>>>>>>>>>>>> start crawling a big sized documents (19Mo for example), it >>>>>>>>>>>>>>> ends the job >>>>>>>>>>>>>>> with the following error: repeated service interruptions - >>>>>>>>>>>>>>> failure >>>>>>>>>>>>>>> processing document : software caused connection abort: socket >>>>>>>>>>>>>>> write error. >>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem, >>>>>>>>>>>>>>> please ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >>
