Re: Documents blocked sometimes without errors

Karl Wright Mon, 04 Jun 2018 09:36:39 -0700

Hi Maxence,

The docpriority values for these stuck documents show that they are "null":


  public static final double noDocPriorityValue = 1e9;
  public static final Double nullDocPriority = new
Double(noDocPriorityValue + 1.0);

The document status is "G", which is STATUS_PENDINGPURGATORY, so the
documents are awaiting being queued, which they will never be with a
docpriority set to nullDocPriority.

It isn't supposed to be possible for a document to wind up in this state.
Documents that are pending are always supposed to set a document priority.
I will need to review the code to see how this could happen.

It is also possible that you're seeing a database bug.  I presume that you
are running Postgresql?

Karl


On Mon, Jun 4, 2018 at 8:43 AM msaunier <[email protected]> wrote:

> Thanks for your answers.
>
>
>
> So, I join at this email -> interface screen and csv result.
>
>
>
> Thanks,
>
> Maxence
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:[email protected]]
> *Envoyé :* lundi 4 juin 2018 11:36
> *À :* [email protected]
> *Objet :* Re: Documents blocked sometimes without errors
>
>
>
> Oh, and it should be unnecessary to pause/resume jobs when you bring down
> ManifoldCF for database maintenance.  Stop the agents service, and start it
> again, and you should pick up exactly where you left off.
>
>
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 5:33 AM Karl Wright <[email protected]> wrote:
>
> Hi Maxence,
>
>
>
> Pausing and restarting a job causes all of its documents to have their
> docpriority field be recalculated.  It should not be necessary to do this
> in order to have job complete, though.
>
>
>
> All documents that are queued have their docpriority set at the time they
> are added to the queue, but the docpriority they are given depends on how
> many documents in the same document bin that have already been given
> docpriority values.  This is done to make sure documents from all bins are
> given an equal chance of being crawled.  But since documents are given a
> docpriority when queued, there may well have been plenty of other documents
> "in front" of them that are already queued and must be processed before
> there's any chance of getting crawled.  So it is possible that documents
> from one job may appear to block documents from another -- but this will
> eventually correct itself and those documents will be crawled.
>
> If you see *no* activity at all, however, then I wonder if somehow
> documents have been queued with a null docpriority.  You can test this by
> looking at the Document Status report and verifying that there is no reason
> the documents should not be crawlable, and then looking in the database to
> see what they have for their docpriority field.  Please let me know what
> you find.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Mon, Jun 4, 2018 at 4:20 AM msaunier <[email protected]> wrote:
>
> Hello Karl,
>
>
>
> Sometimes, jobs are blocked by many documents and I don’t know why because
> I don’t have errors. To unblock this, I paused and resume the job and it
> working. This is not always the case and they are never the same documents.
>
>
>
> We have a script at 8h55 PM and it’s possibly the reason of this error. We
> have create this script to avoid error, because SCO servers are reboot at
> 9h00 PM and ManifoldCF have an error if they servers are stopped.
>
>
>
> Script explanation:
>
>
>
> 1.       Call PAUSED for the current job at 8h55PM
>
> 2.       Call ManifoldCF stop and wait
>
> 3.       VACUUM FULL Postgres
>
> 4.       REINDEX Postgres
>
> 5.       (Wait 9h05 PM)
>
> 6.       Start ManifoldCF
>
> 7.       Wait ManifoldCF
>
> 8.       Resume job
>
>
>
> Do you have an idea to resolved this problem? It’s the REINDEX or the
> VACUUM FULL the problem?
>
>
>
> Thanks,
>
> Maxence
>
>
>
>

Re: Documents blocked sometimes without errors

Reply via email to