Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 04:05:13 -0700

It looks like a dependency of Apache POI is missing.
I think we will need a ticket to address this, if you are indeed using the
binary distribution.


Thanks!
Karl

On Thu, Aug 31, 2017 at 6:57 AM, Beelz Ryuzaki <[email protected]> wrote:

> I'm actually using the binary version. For security reasons, I can't send
> any files from my computer. I have copied the stack trace and scanned it
> with my cellphone. I hope it will be helpful. Meanwhile, I have read the
> documentation about how to restrict the crawling and I don't think the '|'
> works in the specified. For instance, I would like to restrict the crawling
> for the documents that counts the 'sound' word . I proceed as follows:
> *(SON)* . the document is with capital letters and I noticed that it didn't
> take it into consideration.
>
> Thanks,
> Othman
>
>
>
> On Thu, 31 Aug 2017 at 12:40, Karl Wright <[email protected]> wrote:
>
>> Hi Othman,
>>
>> The way you restrict documents with the windows share connector is by
>> specifying information on the "Paths" tab in jobs that crawl windows
>> shares.  There is end-user documentation both online and distributed with
>> all binary distributions that describe how to do this.  Have you found it?
>>
>> Karl
>>
>>
>> On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]>
>> wrote:
>>
>>> Hello Karl,
>>>
>>> Thank you for your response, I will start using zookeeper and I will let
>>> you know if it works. I have another question to ask. Actually, I need to
>>> make some filters while crawling. I don't want to crawl some files and some
>>> folders. Could you give me an example of how to use the regex. Does the
>>> regex allow to use /i to ignore cases ?
>>>
>>> Thanks,
>>> Othman
>>>
>>> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Beelz,
>>>>
>>>> File-based sync is deprecated because people often have problems with
>>>> getting file permissions right, and they do not understand how to shut
>>>> processes down cleanly, and zookeeper is resilient against that.  I highly
>>>> recommend using zookeeper sync.
>>>>
>>>> ManifoldCF is engineered to not put files into memory so you do not
>>>> need huge amounts of memory.  The default values are more than enough for
>>>> 35,000 files, which is a pretty small job for ManifoldCF.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>>>> different from file based sync? I also need a guidance on how to manage my
>>>>> pc's memory. How many Go should I allocate for the start-agent of
>>>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>>>
>>>>> Othman.
>>>>>
>>>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> Your disk is not writable for some reason, and that's interfering
>>>>>> with ManifoldCF 2.8 locking.
>>>>>>
>>>>>> I would suggest two things:
>>>>>>
>>>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>>>> (2) Have a look if you still get failures after that.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mr Karl,
>>>>>>>
>>>>>>> Thank you Mr Karl for your quick response. I have looked into the
>>>>>>> ManifoldCF log file and extracted the following warnings :
>>>>>>>
>>>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>>>> 8\multiprocess-file-example\.\.\synch 
>>>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>>>
>>>>>>>
>>>>>>> - Couldn't write to lock file; disk may be full. Shutting down
>>>>>>> process; locks may be left dangling. You must cleanup before restarting.
>>>>>>>
>>>>>>> ES (lowercase) synapses being the elasticsearch output connection.
>>>>>>> Moreover, the job uses Tika to extract metadata and a file system as a
>>>>>>> repository connection. During the job, I don't extract the content of 
>>>>>>> the
>>>>>>> documents. I was wandering if the issue comes from elasticsearch ?
>>>>>>>
>>>>>>> Othman.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Othman,
>>>>>>>>
>>>>>>>> ManifoldCF aborts a job if there's an error that looks like it
>>>>>>>> might go away on retry, but does not.  It can be either on the 
>>>>>>>> repository
>>>>>>>> side or on the output side.  If you look at the Simple History in the 
>>>>>>>> UI,
>>>>>>>> or at the manifoldcf.log file, you should be able to get a better 
>>>>>>>> sense of
>>>>>>>> what went wrong.  Without further information, I can't say any more.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm Othman Belhaj, a software engineer from société générale in
>>>>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm
>>>>>>>>> working on an internal search engine. For this reason, I'm using 
>>>>>>>>> manifoldcf
>>>>>>>>> in order to index documents on windows shares. I encountered a serious
>>>>>>>>> problem while crawling 35K documents. Most of the time, when 
>>>>>>>>> manifoldcf
>>>>>>>>> start crawling a big sized documents (19Mo for example), it ends the 
>>>>>>>>> job
>>>>>>>>> with the following error: repeated service interruptions - failure
>>>>>>>>> processing document : software caused connection abort: socket write 
>>>>>>>>> error.
>>>>>>>>> Can you give me some tips on how to solve this problem, please ?
>>>>>>>>>
>>>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>>>> I'm looking forward for your response.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> Othman BELHAJ
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Re: Question about ManifoldCF 2.8

Reply via email to