It's not related at all to elasticsearch. Karl
On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <[email protected]> wrote: > Could it be a problem of elasticsearch's version ? I'm actually using > 2.1.0 which is pretty old for this new version of ManifoldCF? > > Othman. > > On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <[email protected]> wrote: > >> I moved back both the jars you mentioned and a different is showing. You >> will find the stack trace attached. >> >> Thanks, >> Othman >> >> On Thu, 31 Aug 2017 at 17:09, Karl Wright <[email protected]> wrote: >> >>> I've looked at the dependencies; you should not have moved >>> poi-3.15.jar. Please move that back, and commons-collections4-4.1.jar too. >>> >>> You *will* need to move curvesapi-1.04.jar though. >>> >>> Thanks, >>> Karl >>> >>> >>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <[email protected]> >>> wrote: >>> >>>> If you include poi.jar, then all dependencies of poi.jar must also be >>>> included. This would mean that curvesapi-1.04.jar and >>>> commons-collections4-4.1.jar should also be included. >>>> >>>> Karl >>>> >>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <[email protected]> >>>> wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> I added the two jars that you have mentioned and another one : >>>>> poi-3.15.jar . Unfortunately, there is another error showing. This time, >>>>> it >>>>> concerns excel files. You will find attached the stack trace. >>>>> >>>>> Othman. >>>>> >>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]> wrote: >>>>> >>>>>> Hi Othman, >>>>>> >>>>>> Yes, this shows that the jar we moved calls back into another jar, >>>>>> which will also need to be moved. *That* jar has yet another dependency >>>>>> too. >>>>>> >>>>>> The list of jars is thus extended to include: >>>>>> >>>>>> poi-ooxml-3.15.jar >>>>>> dom4j-1.6.1.jar >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> You will find attached the stack trace. My apologies for the bad >>>>>>> quality of the image, I'm doing my best to send you the stack trace as I >>>>>>> don't have the right to send documents outside the company. >>>>>>> >>>>>>> Thank you for your time, >>>>>>> >>>>>>> Othman >>>>>>> >>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Once again, I need a stack trace to diagnose what the problem is. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki <[email protected] >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Oh, actually it didn't solve the problem. I looked into the log >>>>>>>>> file and saw the following error: >>>>>>>>> >>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>>>>>>>> >>>>>>>>> Maybe another jar is missing ? >>>>>>>>> >>>>>>>>> Othman. >>>>>>>>> >>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have tried what you told me to do, and you expected the >>>>>>>>>> crawling resumed. How about the regular expressions? How can I make >>>>>>>>>> complex >>>>>>>>>> regular expressions in the job's paths tab ? >>>>>>>>>> >>>>>>>>>> Thank you very much for your help. >>>>>>>>>> >>>>>>>>>> Othman. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>>>>>> >>>>>>>>>>> Othman. >>>>>>>>>>> >>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Oh, and you also may need to edit your options.env files to >>>>>>>>>>>> include them in the classpath for startup. >>>>>>>>>>>> >>>>>>>>>>>> Karl >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If you are amenable, there is another workaround you could >>>>>>>>>>>>> try. Specifically: >>>>>>>>>>>>> >>>>>>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>>>>>> (2) Move the following two files from connector-common-lib to >>>>>>>>>>>>> lib: >>>>>>>>>>>>> >>>>>>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>>>>>> >>>>>>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>>>>>> >>>>>>>>>>>>> Please let me know what happens. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>>>>>> >>>>>>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm still >>>>>>>>>>>>>> looking >>>>>>>>>>>>>> into why the jar is not being found. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my >>>>>>>>>>>>>>> job got stuck on that specific file. >>>>>>>>>>>>>>> The job status is still Running. You can see it in the >>>>>>>>>>>>>>> attached file. For your information, the job started yesterday. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>>>>>> I think we will need a ticket to address this, if you are >>>>>>>>>>>>>>>> indeed using the binary distribution. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm actually using the binary version. For security >>>>>>>>>>>>>>>>> reasons, I can't send any files from my computer. I have >>>>>>>>>>>>>>>>> copied the stack >>>>>>>>>>>>>>>>> trace and scanned it with my cellphone. I hope it will be >>>>>>>>>>>>>>>>> helpful. >>>>>>>>>>>>>>>>> Meanwhile, I have read the documentation about how to >>>>>>>>>>>>>>>>> restrict the crawling >>>>>>>>>>>>>>>>> and I don't think the '|' works in the specified. For >>>>>>>>>>>>>>>>> instance, I would >>>>>>>>>>>>>>>>> like to restrict the crawling for the documents that counts >>>>>>>>>>>>>>>>> the 'sound' >>>>>>>>>>>>>>>>> word . I proceed as follows: *(SON)* . the document is with >>>>>>>>>>>>>>>>> capital letters >>>>>>>>>>>>>>>>> and I noticed that it didn't take it into consideration. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab in >>>>>>>>>>>>>>>>>> jobs that >>>>>>>>>>>>>>>>>> crawl windows shares. There is end-user documentation both >>>>>>>>>>>>>>>>>> online and >>>>>>>>>>>>>>>>>> distributed with all binary distributions that describe how >>>>>>>>>>>>>>>>>> to do this. >>>>>>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thank you for your response, I will start using >>>>>>>>>>>>>>>>>>> zookeeper and I will let you know if it works. I have >>>>>>>>>>>>>>>>>>> another question to >>>>>>>>>>>>>>>>>>> ask. Actually, I need to make some filters while crawling. >>>>>>>>>>>>>>>>>>> I don't want to >>>>>>>>>>>>>>>>>>> crawl some files and some folders. Could you give me an >>>>>>>>>>>>>>>>>>> example of how to >>>>>>>>>>>>>>>>>>> use the regex. Does the regex allow to use /i to ignore >>>>>>>>>>>>>>>>>>> cases ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often have >>>>>>>>>>>>>>>>>>>> problems with getting file permissions right, and they do >>>>>>>>>>>>>>>>>>>> not understand >>>>>>>>>>>>>>>>>>>> how to shut processes down cleanly, and zookeeper is >>>>>>>>>>>>>>>>>>>> resilient against >>>>>>>>>>>>>>>>>>>> that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory >>>>>>>>>>>>>>>>>>>> so you do not need huge amounts of memory. The default >>>>>>>>>>>>>>>>>>>> values are more >>>>>>>>>>>>>>>>>>>> than enough for 35,000 files, which is a pretty small job >>>>>>>>>>>>>>>>>>>> for ManifoldCF. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how >>>>>>>>>>>>>>>>>>>>> is zookeeper different from file based sync? I also need >>>>>>>>>>>>>>>>>>>>> a guidance on how >>>>>>>>>>>>>>>>>>>>> to manage my pc's memory. How many Go should I allocate >>>>>>>>>>>>>>>>>>>>> for the start-agent >>>>>>>>>>>>>>>>>>>>> of ManifoldCF? Is 4Go enough in order to crawler 35K >>>>>>>>>>>>>>>>>>>>> files ? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and that's >>>>>>>>>>>>>>>>>>>>>> interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based sync. >>>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have >>>>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the >>>>>>>>>>>>>>>>>>>>>>> following warnings : >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2. >>>>>>>>>>>>>>>>>>>>>>> 8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You >>>>>>>>>>>>>>>>>>>>>>> must cleanup before >>>>>>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch >>>>>>>>>>>>>>>>>>>>>>> output connection. Moreover, the job uses Tika to >>>>>>>>>>>>>>>>>>>>>>> extract metadata and a >>>>>>>>>>>>>>>>>>>>>>> file system as a repository connection. During the job, >>>>>>>>>>>>>>>>>>>>>>> I don't extract the >>>>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue >>>>>>>>>>>>>>>>>>>>>>> comes from >>>>>>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that >>>>>>>>>>>>>>>>>>>>>>>> looks like it might go away on retry, but does not. >>>>>>>>>>>>>>>>>>>>>>>> It can be either on >>>>>>>>>>>>>>>>>>>>>>>> the repository side or on the output side. If you >>>>>>>>>>>>>>>>>>>>>>>> look at the Simple >>>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log file, you >>>>>>>>>>>>>>>>>>>>>>>> should be able to get >>>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>>>>>>>> information, I can't >>>>>>>>>>>>>>>>>>>>>>>> say any more. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from >>>>>>>>>>>>>>>>>>>>>>>>> société générale in France. I'm actually using your >>>>>>>>>>>>>>>>>>>>>>>>> recent version of >>>>>>>>>>>>>>>>>>>>>>>>> manifoldCF 2.8 . I'm working on an internal search >>>>>>>>>>>>>>>>>>>>>>>>> engine. For this reason, >>>>>>>>>>>>>>>>>>>>>>>>> I'm using manifoldcf in order to index documents on >>>>>>>>>>>>>>>>>>>>>>>>> windows shares. I >>>>>>>>>>>>>>>>>>>>>>>>> encountered a serious problem while crawling 35K >>>>>>>>>>>>>>>>>>>>>>>>> documents. Most of the >>>>>>>>>>>>>>>>>>>>>>>>> time, when manifoldcf start crawling a big sized >>>>>>>>>>>>>>>>>>>>>>>>> documents (19Mo for >>>>>>>>>>>>>>>>>>>>>>>>> example), it ends the job with the following error: >>>>>>>>>>>>>>>>>>>>>>>>> repeated service >>>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure processing document : >>>>>>>>>>>>>>>>>>>>>>>>> software caused connection >>>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error. >>>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>> >>>>>> >>>> >>>
