Yes, I added it in the options.env.win file. Should it be the one in the multiprocess-zk-example document or multiprocess-file-example ?
On Thu, 31 Aug 2017 at 17:30, Karl Wright <[email protected]> wrote: > It's not related at all to elasticsearch. > Karl > > > On Thu, Aug 31, 2017 at 11:26 AM, Beelz Ryuzaki <[email protected]> > wrote: > >> Could it be a problem of elasticsearch's version ? I'm actually using >> 2.1.0 which is pretty old for this new version of ManifoldCF? >> >> Othman. >> >> On Thu, 31 Aug 2017 at 17:23, Beelz Ryuzaki <[email protected]> wrote: >> >>> I moved back both the jars you mentioned and a different is showing. You >>> will find the stack trace attached. >>> >>> Thanks, >>> Othman >>> >>> On Thu, 31 Aug 2017 at 17:09, Karl Wright <[email protected]> wrote: >>> >>>> I've looked at the dependencies; you should not have moved >>>> poi-3.15.jar. Please move that back, and commons-collections4-4.1.jar too. >>>> >>>> You *will* need to move curvesapi-1.04.jar though. >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> On Thu, Aug 31, 2017 at 11:04 AM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> If you include poi.jar, then all dependencies of poi.jar must also be >>>>> included. This would mean that curvesapi-1.04.jar and >>>>> commons-collections4-4.1.jar should also be included. >>>>> >>>>> Karl >>>>> >>>>> On Thu, Aug 31, 2017 at 10:23 AM, Beelz Ryuzaki <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> I added the two jars that you have mentioned and another one : >>>>>> poi-3.15.jar . Unfortunately, there is another error showing. This time, >>>>>> it >>>>>> concerns excel files. You will find attached the stack trace. >>>>>> >>>>>> Othman. >>>>>> >>>>>> On Thu, 31 Aug 2017 at 15:32, Karl Wright <[email protected]> wrote: >>>>>> >>>>>>> Hi Othman, >>>>>>> >>>>>>> Yes, this shows that the jar we moved calls back into another jar, >>>>>>> which will also need to be moved. *That* jar has yet another dependency >>>>>>> too. >>>>>>> >>>>>>> The list of jars is thus extended to include: >>>>>>> >>>>>>> poi-ooxml-3.15.jar >>>>>>> dom4j-1.6.1.jar >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 31, 2017 at 9:25 AM, Beelz Ryuzaki <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> You will find attached the stack trace. My apologies for the bad >>>>>>>> quality of the image, I'm doing my best to send you the stack trace as >>>>>>>> I >>>>>>>> don't have the right to send documents outside the company. >>>>>>>> >>>>>>>> Thank you for your time, >>>>>>>> >>>>>>>> Othman >>>>>>>> >>>>>>>> On Thu, 31 Aug 2017 at 15:16, Karl Wright <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Once again, I need a stack trace to diagnose what the problem is. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 31, 2017 at 9:14 AM, Beelz Ryuzaki < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Oh, actually it didn't solve the problem. I looked into the log >>>>>>>>>> file and saw the following error: >>>>>>>>>> >>>>>>>>>> Error tossed : org/apache/poi/POIXMLTypeLoader >>>>>>>>>> java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader. >>>>>>>>>> >>>>>>>>>> Maybe another jar is missing ? >>>>>>>>>> >>>>>>>>>> Othman. >>>>>>>>>> >>>>>>>>>> On Thu, 31 Aug 2017 at 15:01, Beelz Ryuzaki <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I have tried what you told me to do, and you expected the >>>>>>>>>>> crawling resumed. How about the regular expressions? How can I make >>>>>>>>>>> complex >>>>>>>>>>> regular expressions in the job's paths tab ? >>>>>>>>>>> >>>>>>>>>>> Thank you very much for your help. >>>>>>>>>>> >>>>>>>>>>> Othman. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 31 Aug 2017 at 14:47, Beelz Ryuzaki <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok, I will try it right away and let you know if it works. >>>>>>>>>>>> >>>>>>>>>>>> Othman. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 31 Aug 2017 at 14:15, Karl Wright <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Oh, and you also may need to edit your options.env files to >>>>>>>>>>>>> include them in the classpath for startup. >>>>>>>>>>>>> >>>>>>>>>>>>> Karl >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:53 AM, Karl Wright < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> If you are amenable, there is another workaround you could >>>>>>>>>>>>>> try. Specifically: >>>>>>>>>>>>>> >>>>>>>>>>>>>> (1) Shut down all MCF processes. >>>>>>>>>>>>>> (2) Move the following two files from connector-common-lib to >>>>>>>>>>>>>> lib: >>>>>>>>>>>>>> >>>>>>>>>>>>>> xmlbeans-2.6.0.jar >>>>>>>>>>>>>> poi-ooxml-schemas-3.15.jar >>>>>>>>>>>>>> >>>>>>>>>>>>>> (3) Restart everything and see if your crawl resumes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please let me know what happens. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Karl >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:33 AM, Karl Wright < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I created a ticket for this: CONNECTORS-1450. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One simple workaround is to use the external Tika server >>>>>>>>>>>>>>> transformer rather than the embedded Tika Extractor. I'm still >>>>>>>>>>>>>>> looking >>>>>>>>>>>>>>> into why the jar is not being found. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 7:08 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes, I'm actually using the latest binary version, and my >>>>>>>>>>>>>>>> job got stuck on that specific file. >>>>>>>>>>>>>>>> The job status is still Running. You can see it in the >>>>>>>>>>>>>>>> attached file. For your information, the job started yesterday. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 13:04, Karl Wright < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It looks like a dependency of Apache POI is missing. >>>>>>>>>>>>>>>>> I think we will need a ticket to address this, if you are >>>>>>>>>>>>>>>>> indeed using the binary distribution. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm actually using the binary version. For security >>>>>>>>>>>>>>>>>> reasons, I can't send any files from my computer. I have >>>>>>>>>>>>>>>>>> copied the stack >>>>>>>>>>>>>>>>>> trace and scanned it with my cellphone. I hope it will be >>>>>>>>>>>>>>>>>> helpful. >>>>>>>>>>>>>>>>>> Meanwhile, I have read the documentation about how to >>>>>>>>>>>>>>>>>> restrict the crawling >>>>>>>>>>>>>>>>>> and I don't think the '|' works in the specified. For >>>>>>>>>>>>>>>>>> instance, I would >>>>>>>>>>>>>>>>>> like to restrict the crawling for the documents that counts >>>>>>>>>>>>>>>>>> the 'sound' >>>>>>>>>>>>>>>>>> word . I proceed as follows: *(SON)* . the document is with >>>>>>>>>>>>>>>>>> capital letters >>>>>>>>>>>>>>>>>> and I noticed that it didn't take it into consideration. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, 31 Aug 2017 at 12:40, Karl Wright < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The way you restrict documents with the windows share >>>>>>>>>>>>>>>>>>> connector is by specifying information on the "Paths" tab >>>>>>>>>>>>>>>>>>> in jobs that >>>>>>>>>>>>>>>>>>> crawl windows shares. There is end-user documentation both >>>>>>>>>>>>>>>>>>> online and >>>>>>>>>>>>>>>>>>> distributed with all binary distributions that describe how >>>>>>>>>>>>>>>>>>> to do this. >>>>>>>>>>>>>>>>>>> Have you found it? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hello Karl, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you for your response, I will start using >>>>>>>>>>>>>>>>>>>> zookeeper and I will let you know if it works. I have >>>>>>>>>>>>>>>>>>>> another question to >>>>>>>>>>>>>>>>>>>> ask. Actually, I need to make some filters while crawling. >>>>>>>>>>>>>>>>>>>> I don't want to >>>>>>>>>>>>>>>>>>>> crawl some files and some folders. Could you give me an >>>>>>>>>>>>>>>>>>>> example of how to >>>>>>>>>>>>>>>>>>>> use the regex. Does the regex allow to use /i to ignore >>>>>>>>>>>>>>>>>>>> cases ? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>> Othman >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Beelz, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> File-based sync is deprecated because people often >>>>>>>>>>>>>>>>>>>>> have problems with getting file permissions right, and >>>>>>>>>>>>>>>>>>>>> they do not >>>>>>>>>>>>>>>>>>>>> understand how to shut processes down cleanly, and >>>>>>>>>>>>>>>>>>>>> zookeeper is resilient >>>>>>>>>>>>>>>>>>>>> against that. I highly recommend using zookeeper sync. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ManifoldCF is engineered to not put files into memory >>>>>>>>>>>>>>>>>>>>> so you do not need huge amounts of memory. The default >>>>>>>>>>>>>>>>>>>>> values are more >>>>>>>>>>>>>>>>>>>>> than enough for 35,000 files, which is a pretty small job >>>>>>>>>>>>>>>>>>>>> for ManifoldCF. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I'm actually not using zookeeper. i want to know how >>>>>>>>>>>>>>>>>>>>>> is zookeeper different from file based sync? I also need >>>>>>>>>>>>>>>>>>>>>> a guidance on how >>>>>>>>>>>>>>>>>>>>>> to manage my pc's memory. How many Go should I allocate >>>>>>>>>>>>>>>>>>>>>> for the start-agent >>>>>>>>>>>>>>>>>>>>>> of ManifoldCF? Is 4Go enough in order to crawler 35K >>>>>>>>>>>>>>>>>>>>>> files ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Your disk is not writable for some reason, and >>>>>>>>>>>>>>>>>>>>>>> that's interfering with ManifoldCF 2.8 locking. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I would suggest two things: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> (1) Use Zookeeper for sync instead of file-based >>>>>>>>>>>>>>>>>>>>>>> sync. >>>>>>>>>>>>>>>>>>>>>>> (2) Have a look if you still get failures after that. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi Mr Karl, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thank you Mr Karl for your quick response. I have >>>>>>>>>>>>>>>>>>>>>>>> looked into the ManifoldCF log file and extracted the >>>>>>>>>>>>>>>>>>>>>>>> following warnings : >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - Attempt to set file lock >>>>>>>>>>>>>>>>>>>>>>>> 'D:\xxxx\apache_manifoldcf-2.8\multiprocess-file-example\.\.\synch >>>>>>>>>>>>>>>>>>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES >>>>>>>>>>>>>>>>>>>>>>>> (Lowercase) >>>>>>>>>>>>>>>>>>>>>>>> Synapses.lock' failed : Access is denied. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - Couldn't write to lock file; disk may be full. >>>>>>>>>>>>>>>>>>>>>>>> Shutting down process; locks may be left dangling. You >>>>>>>>>>>>>>>>>>>>>>>> must cleanup before >>>>>>>>>>>>>>>>>>>>>>>> restarting. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ES (lowercase) synapses being the elasticsearch >>>>>>>>>>>>>>>>>>>>>>>> output connection. Moreover, the job uses Tika to >>>>>>>>>>>>>>>>>>>>>>>> extract metadata and a >>>>>>>>>>>>>>>>>>>>>>>> file system as a repository connection. During the >>>>>>>>>>>>>>>>>>>>>>>> job, I don't extract the >>>>>>>>>>>>>>>>>>>>>>>> content of the documents. I was wandering if the issue >>>>>>>>>>>>>>>>>>>>>>>> comes from >>>>>>>>>>>>>>>>>>>>>>>> elasticsearch ? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Othman. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Othman, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> ManifoldCF aborts a job if there's an error that >>>>>>>>>>>>>>>>>>>>>>>>> looks like it might go away on retry, but does not. >>>>>>>>>>>>>>>>>>>>>>>>> It can be either on >>>>>>>>>>>>>>>>>>>>>>>>> the repository side or on the output side. If you >>>>>>>>>>>>>>>>>>>>>>>>> look at the Simple >>>>>>>>>>>>>>>>>>>>>>>>> History in the UI, or at the manifoldcf.log file, you >>>>>>>>>>>>>>>>>>>>>>>>> should be able to get >>>>>>>>>>>>>>>>>>>>>>>>> a better sense of what went wrong. Without further >>>>>>>>>>>>>>>>>>>>>>>>> information, I can't >>>>>>>>>>>>>>>>>>>>>>>>> say any more. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> Karl >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I'm Othman Belhaj, a software engineer from >>>>>>>>>>>>>>>>>>>>>>>>>> société générale in France. I'm actually using your >>>>>>>>>>>>>>>>>>>>>>>>>> recent version of >>>>>>>>>>>>>>>>>>>>>>>>>> manifoldCF 2.8 . I'm working on an internal search >>>>>>>>>>>>>>>>>>>>>>>>>> engine. For this reason, >>>>>>>>>>>>>>>>>>>>>>>>>> I'm using manifoldcf in order to index documents on >>>>>>>>>>>>>>>>>>>>>>>>>> windows shares. I >>>>>>>>>>>>>>>>>>>>>>>>>> encountered a serious problem while crawling 35K >>>>>>>>>>>>>>>>>>>>>>>>>> documents. Most of the >>>>>>>>>>>>>>>>>>>>>>>>>> time, when manifoldcf start crawling a big sized >>>>>>>>>>>>>>>>>>>>>>>>>> documents (19Mo for >>>>>>>>>>>>>>>>>>>>>>>>>> example), it ends the job with the following error: >>>>>>>>>>>>>>>>>>>>>>>>>> repeated service >>>>>>>>>>>>>>>>>>>>>>>>>> interruptions - failure processing document : >>>>>>>>>>>>>>>>>>>>>>>>>> software caused connection >>>>>>>>>>>>>>>>>>>>>>>>>> abort: socket write error. >>>>>>>>>>>>>>>>>>>>>>>>>> Can you give me some tips on how to solve this >>>>>>>>>>>>>>>>>>>>>>>>>> problem, please ? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 . >>>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward for your response. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Othman BELHAJ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >
