Re: Question about ManifoldCF 2.8

Karl Wright Thu, 31 Aug 2017 03:41:30 -0700

Hi Othman,

The way you restrict documents with the windows share connector is by
specifying information on the "Paths" tab in jobs that crawl windows
shares.  There is end-user documentation both online and distributed with
all binary distributions that describe how to do this.  Have you found it?


Karl


On Thu, Aug 31, 2017 at 5:25 AM, Beelz Ryuzaki <[email protected]> wrote:

> Hello Karl,
>
> Thank you for your response, I will start using zookeeper and I will let
> you know if it works. I have another question to ask. Actually, I need to
> make some filters while crawling. I don't want to crawl some files and some
> folders. Could you give me an example of how to use the regex. Does the
> regex allow to use /i to ignore cases ?
>
> Thanks,
> Othman
>
> On Wed, 30 Aug 2017 at 19:53, Karl Wright <[email protected]> wrote:
>
>> Hi Beelz,
>>
>> File-based sync is deprecated because people often have problems with
>> getting file permissions right, and they do not understand how to shut
>> processes down cleanly, and zookeeper is resilient against that.  I highly
>> recommend using zookeeper sync.
>>
>> ManifoldCF is engineered to not put files into memory so you do not need
>> huge amounts of memory.  The default values are more than enough for 35,000
>> files, which is a pretty small job for ManifoldCF.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Aug 30, 2017 at 11:58 AM, Beelz Ryuzaki <[email protected]>
>> wrote:
>>
>>> I'm actually not using zookeeper. i want to know how is zookeeper
>>> different from file based sync? I also need a guidance on how to manage my
>>> pc's memory. How many Go should I allocate for the start-agent of
>>> ManifoldCF? Is 4Go enough in order to crawler 35K files ?
>>>
>>> Othman.
>>>
>>> On Wed, 30 Aug 2017 at 16:11, Karl Wright <[email protected]> wrote:
>>>
>>>> Your disk is not writable for some reason, and that's interfering with
>>>> ManifoldCF 2.8 locking.
>>>>
>>>> I would suggest two things:
>>>>
>>>> (1) Use Zookeeper for sync instead of file-based sync.
>>>> (2) Have a look if you still get failures after that.
>>>>
>>>> Thanks,
>>>> Karl
>>>>
>>>>
>>>> On Wed, Aug 30, 2017 at 9:37 AM, Beelz Ryuzaki <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Mr Karl,
>>>>>
>>>>> Thank you Mr Karl for your quick response. I have looked into the
>>>>> ManifoldCF log file and extracted the following warnings :
>>>>>
>>>>> - Attempt to set file lock 'D:\xxxx\apache_manifoldcf-2.
>>>>> 8\multiprocess-file-example\.\.\synch 
>>>>> area\569\352\lock-_POOLTARGET_OUTPUTCONNECTORPOOL_ES
>>>>> (Lowercase) Synapses.lock' failed : Access is denied.
>>>>>
>>>>>
>>>>> - Couldn't write to lock file; disk may be full. Shutting down
>>>>> process; locks may be left dangling. You must cleanup before restarting.
>>>>>
>>>>> ES (lowercase) synapses being the elasticsearch output connection.
>>>>> Moreover, the job uses Tika to extract metadata and a file system as a
>>>>> repository connection. During the job, I don't extract the content of the
>>>>> documents. I was wandering if the issue comes from elasticsearch ?
>>>>>
>>>>> Othman.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 30 Aug 2017 at 14:08, Karl Wright <[email protected]> wrote:
>>>>>
>>>>>> Hi Othman,
>>>>>>
>>>>>> ManifoldCF aborts a job if there's an error that looks like it might
>>>>>> go away on retry, but does not.  It can be either on the repository side 
>>>>>> or
>>>>>> on the output side.  If you look at the Simple History in the UI, or at 
>>>>>> the
>>>>>> manifoldcf.log file, you should be able to get a better sense of what 
>>>>>> went
>>>>>> wrong.  Without further information, I can't say any more.
>>>>>>
>>>>>> Thanks,
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 30, 2017 at 5:33 AM, Beelz Ryuzaki <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm Othman Belhaj, a software engineer from société générale in
>>>>>>> France. I'm actually using your recent version of manifoldCF 2.8 . I'm
>>>>>>> working on an internal search engine. For this reason, I'm using 
>>>>>>> manifoldcf
>>>>>>> in order to index documents on windows shares. I encountered a serious
>>>>>>> problem while crawling 35K documents. Most of the time, when manifoldcf
>>>>>>> start crawling a big sized documents (19Mo for example), it ends the job
>>>>>>> with the following error: repeated service interruptions - failure
>>>>>>> processing document : software caused connection abort: socket write 
>>>>>>> error.
>>>>>>> Can you give me some tips on how to solve this problem, please ?
>>>>>>>
>>>>>>> I use PostgreSQL 9.3.x and elasticsearch 2.1.0 .
>>>>>>> I'm looking forward for your response.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Othman BELHAJ
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>

Re: Question about ManifoldCF 2.8

Reply via email to