Re: Fw: Questions about the behaviour of a custom Manifoldcf repository connector

2017-01-13 Thread Karl Wright
Hi Vigi,

For a description of the internals of ManifoldCF, you will want to read
ManifoldCF In Action, available for free here:

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs

In order to model difficulty talking to a repository, what you want to do
is throw a ServiceInterruption while seeding.  This will prevent the job
from running at all.  ServiceInterruption exceptions can be thrown with
various signals, e.g. "retry" a certain number of times, "skip" the current
document when retries fail, "fail" the job when retries fail, etc.

As for why your connector doesn't complete a job run within a window, I
cannot tell you.  I would suggest trying to set up the file system
connector with the same configuration and see what happens with that.  You
should be able to figure out what the difference is between the two readily
enough if one succeeds and the other does not.

Thanks,
Karl


On Fri, Jan 13, 2017 at 3:58 AM, Virgiliu R  wrote:

> Hello,
>
>
> I wrote a custom Manifoldcf repository connector for an internal document
> system and it has some strange behaviours which I am not able to explain.
>
>
> 1. When I schedule the job to run on a specific day and at a specific
> time, the job runs but after the shutdown it decides that it still is
> within the run window and it restarts again. This goes on multiple times,
> in the end the job ends up running 15 times or more. I checked the job
> history and there is no 'job end' event but I can see all the 'job start'
> events which took place after the schedule window start time. Invoking the
> job manually works fine, i.e. it runs only once. Also, because I put a
> maximum run time of 300 minutes, the job ends up in a waiting state after
> the interval expires.
>
>
> Below you can find some of the logs of this particular job.
>
>
>  (Finisher thread) - Marked job 1470044524072 for shutdown
>  INFO 2017-01-13 03:53:46,848 (Job notification thread) - Found job
> 1470044524072 in need of notification
>  INFO 2017-01-13 03:53:51,349 (Job start thread) - Job '1470044524072' is
> within run window at 1484276031338 ms. (which starts at 148425840 ms.
> and goes for 1800 ms.)
>  INFO 2017-01-13 03:53:51,356 (Job start thread) - Signalled for job start
> for job 1470044524072
>  INFO 2017-01-13 03:53:55,479 (Startup thread) - Marked job 1470044524072
> for startup
>
> Why does it have this behaviour and how can I correct it?
>
> 2. In the second scenario I had indexed some documents and I wanted to
> simulate the fact that our internal repository  was not available. In the
> current implementation, if there are any errors while seeding the
> documents, then I do not throw an exception but instead provide an empty
> list of documents to be seeded. What happens next is that Manifoldcf
> processes the already indexed documents and in this case the connector
> throws ServiceInterruptionExceptions which after 3 unsuccessful retries
> make the job stop. However, the clean-up thread of Manifoldcf decides that
> all the documents need to be deleted from the index. I would like to
> keep/update the documents, not delete them, that's why I chose a connector
> model of ADD_CHANGE. There is only one place where I specifically invoke
> activities.deleteDocument but this happens only when our document
> repository is available and the document cannot be found. This scenario is
> exceptional and will almost never happen in practice because the document
> repository never deletes the files.
>
> Why does the Manifoldcf clean-up thread mark the documents for deletion
> since my connector does not do it on purpose?
>
> Thank you,
> Vigi
>
>


Fw: Questions about the behaviour of a custom Manifoldcf repository connector

2017-01-13 Thread Virgiliu R
Hello,


I wrote a custom Manifoldcf repository connector for an internal document 
system and it has some strange behaviours which I am not able to explain.


1. When I schedule the job to run on a specific day and at a specific time, the 
job runs but after the shutdown it decides that it still is within the run 
window and it restarts again. This goes on multiple times, in the end the job 
ends up running 15 times or more. I checked the job history and there is no 
'job end' event but I can see all the 'job start' events which took place after 
the schedule window start time. Invoking the job manually works fine, i.e. it 
runs only once. Also, because I put a maximum run time of 300 minutes, the job 
ends up in a waiting state after the interval expires.


Below you can find some of the logs of this particular job.


 (Finisher thread) - Marked job 1470044524072 for shutdown
 INFO 2017-01-13 03:53:46,848 (Job notification thread) - Found job 
1470044524072 in need of notification
 INFO 2017-01-13 03:53:51,349 (Job start thread) - Job '1470044524072' is 
within run window at 1484276031338 ms. (which starts at 148425840 ms. and 
goes for 1800 ms.)
 INFO 2017-01-13 03:53:51,356 (Job start thread) - Signalled for job start for 
job 1470044524072
 INFO 2017-01-13 03:53:55,479 (Startup thread) - Marked job 1470044524072 for 
startup

Why does it have this behaviour and how can I correct it?

2. In the second scenario I had indexed some documents and I wanted to simulate 
the fact that our internal repository  was not available. In the current 
implementation, if there are any errors while seeding the documents, then I do 
not throw an exception but instead provide an empty list of documents to be 
seeded. What happens next is that Manifoldcf processes the already indexed 
documents and in this case the connector throws ServiceInterruptionExceptions 
which after 3 unsuccessful retries make the job stop. However, the clean-up 
thread of Manifoldcf decides that all the documents need to be deleted from the 
index. I would like to keep/update the documents, not delete them, that's why I 
chose a connector model of ADD_CHANGE. There is only one place where I 
specifically invoke activities.deleteDocument but this happens only when our 
document repository is available and the document cannot be found. This 
scenario is exceptional and will almost never happen in practice because the 
document repository never deletes the files.

Why does the Manifoldcf clean-up thread mark the documents for deletion since 
my connector does not do it on purpose?

Thank you,
Vigi