Hi Maxence, The docpriority values for these stuck documents show that they are "null":
public static final double noDocPriorityValue = 1e9; public static final Double nullDocPriority = new Double(noDocPriorityValue + 1.0); The document status is "G", which is STATUS_PENDINGPURGATORY, so the documents are awaiting being queued, which they will never be with a docpriority set to nullDocPriority. It isn't supposed to be possible for a document to wind up in this state. Documents that are pending are always supposed to set a document priority. I will need to review the code to see how this could happen. It is also possible that you're seeing a database bug. I presume that you are running Postgresql? Karl On Mon, Jun 4, 2018 at 8:43 AM msaunier <[email protected]> wrote: > Thanks for your answers. > > > > So, I join at this email -> interface screen and csv result. > > > > Thanks, > > Maxence > > > > > > > > *De :* Karl Wright [mailto:[email protected]] > *Envoyé :* lundi 4 juin 2018 11:36 > *À :* [email protected] > *Objet :* Re: Documents blocked sometimes without errors > > > > Oh, and it should be unnecessary to pause/resume jobs when you bring down > ManifoldCF for database maintenance. Stop the agents service, and start it > again, and you should pick up exactly where you left off. > > > > Karl > > > > > > On Mon, Jun 4, 2018 at 5:33 AM Karl Wright <[email protected]> wrote: > > Hi Maxence, > > > > Pausing and restarting a job causes all of its documents to have their > docpriority field be recalculated. It should not be necessary to do this > in order to have job complete, though. > > > > All documents that are queued have their docpriority set at the time they > are added to the queue, but the docpriority they are given depends on how > many documents in the same document bin that have already been given > docpriority values. This is done to make sure documents from all bins are > given an equal chance of being crawled. But since documents are given a > docpriority when queued, there may well have been plenty of other documents > "in front" of them that are already queued and must be processed before > there's any chance of getting crawled. So it is possible that documents > from one job may appear to block documents from another -- but this will > eventually correct itself and those documents will be crawled. > > If you see *no* activity at all, however, then I wonder if somehow > documents have been queued with a null docpriority. You can test this by > looking at the Document Status report and verifying that there is no reason > the documents should not be crawlable, and then looking in the database to > see what they have for their docpriority field. Please let me know what > you find. > > > > Thanks, > > Karl > > > > > > On Mon, Jun 4, 2018 at 4:20 AM msaunier <[email protected]> wrote: > > Hello Karl, > > > > Sometimes, jobs are blocked by many documents and I don’t know why because > I don’t have errors. To unblock this, I paused and resume the job and it > working. This is not always the case and they are never the same documents. > > > > We have a script at 8h55 PM and it’s possibly the reason of this error. We > have create this script to avoid error, because SCO servers are reboot at > 9h00 PM and ManifoldCF have an error if they servers are stopped. > > > > Script explanation: > > > > 1. Call PAUSED for the current job at 8h55PM > > 2. Call ManifoldCF stop and wait > > 3. VACUUM FULL Postgres > > 4. REINDEX Postgres > > 5. (Wait 9h05 PM) > > 6. Start ManifoldCF > > 7. Wait ManifoldCF > > 8. Resume job > > > > Do you have an idea to resolved this problem? It’s the REINDEX or the > VACUUM FULL the problem? > > > > Thanks, > > Maxence > > > >
