Hi Othman, Yes, this shows that the jar we moved calls back into another jar, which will also need to be moved. *That* jar has yet another dependency too.
The list of jars is thus extended to include: poi-ooxml-3.15.jar dom4j-1.6.1.jar Karl On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]> wrote: > You will find attached the stack trace. My apologies for the bad quality > of the image, I'm doing my best to send you the stack trace as I don't have > the right to send documents outside the company. > > Thank you for your time, > > Othman > > On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> wrote: > >> Once again, I need a stack trace to diagnose what the problem is. >> >> Thanks, >> Karl >> >> >> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected]> >> wrote: >> >>> Oh, actually it didn't solve the problem. I looked into the log file and >>> saw the following error: >>> >>> Error tossed : org/apache/poi/POIXMLTypeLoader >>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>> >>> Maybe another jar is missing ? >>> >>> Othman. >>> >>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> wrote: >>> >>>> I have tried what you told me to do, and you expected the crawling >>>> resumed. How about the regular expressions? How can I make complex regular >>>> expressions in the job's paths tab ? >>>> >>>> Thank you very much for your help. >>>> >>>> Othman. >>>> >>>> >>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> >>>> wrote: >>>> >>>>> Ok, I will try it right away and let you know if it works. >>>>> >>>>> Othman. >>>>> >>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> wrote: >>>>> >>>>>> Oh, and you also may need to edit your options.env files to include >>>>>> them in the classpath for startup. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> If you are amenable, there is another workaround you could try. >>>>>>> Specifically: >>>>>>> >>>>>>> (1) Shut down all MCF processes. >>>>>>> (2) Move the following two files from connector-common-lib to lib: >>>>>>> >>>>>>> xmlbeans-2.6.0.jar >>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>> >>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>> >>>>>>> Please let me know what happens. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>> >>>>>>>> One simple workaround is to use the external Tika server >>>>>>>> transformer rather than the embedded Tika Extractor. I'm still looking >>>>>>>> into why the jar is not being found. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Yes, I'm actually using the latest binary version, and my job got >>>>>>>>> stuck on that specific file. >>>>>>>>> The job status is still Running. You can see it in the attached >>>>>>>>> file. For your information, the job started yesterday. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Othman >>>>>>>>> >>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>> I think we will need a ticket to address this, if you are indeed >>>>>>>>>> using the binary distribution. >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I'm actually using the binary version. For security reasons, I >>>>>>>>>>> can't send any files from my computer. I have copied the stack >>>>>>>>>>> trace and >>>>>>>>>>> scanned it with my cellphone. I hope it will be helpful. Meanwhile, >>>>>>>>>>> I have >>>>>>>>>>> read the documentation about how to restrict the crawling and I >>>>>>>>>>> don't think >>>>>>>>>>> the '|' works in the specified. For instance, I would like to >>>>>>>>>>> restrict the >>>>>>>>>>> crawling for the documents that counts the 'sound' word . I proceed >>>>>>>>>>> as >>>>>>>>>>> follows: *(SON)* . the document is with capital letters and I >>>>>>>>>>> noticed that >>>>>>>>>>> it didn't take it into consideration. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Othman >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Othman, >>>>>>>>>>>> >>>>>>>>>>>> The way you restrict documents with the windows share connector >>>>>>>>>>>> is by specifying information on the "Paths" tab in jobs that crawl >>>>>>>>>>>> windows >>>>>>>>>>>> shares. There is end-user documentation both online and >>>>>>>>>>>> distributed with >>>>>>>>>>>> all binary distributions that describe how to do this. Have you >>>>>>>>>>>> found it? >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for your response, I will start using zookeeper and >>>>>>>>>>>>> I will let you know if it works. I have another question to ask. >>>>>>>>>>>>> Actually, >>>>>>>>>>>>> I need to make some filters while crawling. I don't want to crawl >>>>>>>>>>>>> some >>>>>>>>>>>>> files and some folders. Could you give me an example of how to >>>>>>>>>>>>> use the >>>>>>>>>>>>> regex. Does the regex allow to use /i to ignore cases ? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Othman >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>> >>>>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>>>> problems with getting file permissions right, and they do not >>>>>>>>>>>>>> understand >>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is resilient >>>>>>>>>>>>>> against >>>>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory so you >>>>>>>>>>>>>> do not need huge amounts of memory. The default values are more >>>>>>>>>>>>>> than >>>>>>>>>>>>>> enough for 35,000 files, which is a pretty small job for >>>>>>>>>>>>>> ManifoldCF. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how is >>>>>>>>>>>>>>> zookeeper different from file based sync? I also need a >>>>>>>>>>>>>>> guidance on how to >>>>>>>>>>>>>>> manage my pc's memory. How many Go should I allocate for the >>>>>>>>>>>>>>> start-agent of >>>>>>>>>>>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have looked >>>>>>>>>>>>>>>>> into the ManifoldCF log file and extracted the following >>>>>>>>>>>>>>>>> warnings : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2. >>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. Shutting >>>>>>>>>>>>>>>>> down process; locks may be left dangling. You must cleanup >>>>>>>>>>>>>>>>> before >>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch output >>>>>>>>>>>>>>>>> connection. Moreover, the job uses Tika to extract metadata >>>>>>>>>>>>>>>>> and a file >>>>>>>>>>>>>>>>> system as a repository connection. During the job, I don't >>>>>>>>>>>>>>>>> extract the >>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue comes >>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that looks >>>>>>>>>>>>>>>>>> like it might go away on retry, but does not. It can be >>>>>>>>>>>>>>>>>> either on the >>>>>>>>>>>>>>>>>> repository side or on the output side. If you look at the >>>>>>>>>>>>>>>>>> Simple History >>>>>>>>>>>>>>>>>> in the UI, or at the manifoldcf.log file, you should be able >>>>>>>>>>>>>>>>>> to get a >>>>>>>>>>>>>>>>>> better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>> information, I can't say >>>>>>>>>>>>>>>>>> any more. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from société >>>>>>>>>>>>>>>>>>> générale in France. I'm actually using your recent version >>>>>>>>>>>>>>>>>>> of manifoldCF >>>>>>>>>>>>>>>>>>> 2.8 . I'm working on an internal search engine. For this >>>>>>>>>>>>>>>>>>> reason, I'm using >>>>>>>>>>>>>>>>>>> manifoldcf in order to index documents on windows shares. I >>>>>>>>>>>>>>>>>>> encountered a >>>>>>>>>>>>>>>>>>> serious problem while crawling 35K documents. Most of the >>>>>>>>>>>>>>>>>>> time, when >>>>>>>>>>>>>>>>>>> manifoldcf start crawling a big sized documents (19Mo for >>>>>>>>>>>>>>>>>>> example), it ends >>>>>>>>>>>>>>>>>>> the job with the following error: repeated service >>>>>>>>>>>>>>>>>>> interruptions - failure >>>>>>>>>>>>>>>>>>> processing document : software caused connection abort: >>>>>>>>>>>>>>>>>>> socket write error. >>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this problem, >>>>>>>>>>>>>>>>>>> please ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>
