Hi,

It doesn't look like the problem is on the ManifoldCF crawling side, more on 
the Solr indexing side.

What, if anything, do the Solr logs say about the problem?

Adrian

From: Ronny Heylen [mailto:[email protected]]
Sent: 29 October 2013 10:52
To: [email protected]
Subject: Error in Manifoldcf, what's the first step?

Hi,

Solr is 4.4, manifoldcf 1.3.

We are indexing a shared windows network drive, filtering on *.doc*, *.xls*, 
*.pdf ... with about 650,000 files to index, giving a SOLR index 35GB in size.

The result is great except that the manifoldcf job crashes before the end.

Note that:
- ignoreTikaException is true in solrconfig.xml (otherwise the manifoldcf job 
stops very early).
- tomcat has been given 24 GB of memory (it uses 15GB)
- there are 8 cores

Message in http://localhost:8080/mcf-crawler-ui/showjobstatus.jsp is:
Error: Repeated service interruptions - failure processing document: Server at 
http://localhost:8080/solr/collection1 returned non ok status:500, 
message:Internal Server Error
Then, instead of indexing the full drive in one job, we have defined one job 
for each subfolder.
Almost all "subfolder" jobs end successfully, only for 2 or 3 we receive the 
same message, and for 2 or 3 other ones a different message:

Error: Repeated service interruptions - failure processing document: Read timed 
out
If we try to go further (defining one job for each subfolder of a subfolder in 
error), the same happens: success for almost all subfolders except 1 or 2.
What is the first step to do to solve this problem?
Thanks.
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

Reply via email to