Hi All,

We are trying to crawl and index ppt and msword,excel  mime type documents
as part of seed url which .html page, i mean a seed url which is having
*ppt,msword,ppt* as an attachment.

ex: http://abc.com/solr-tika.html 

I have added below changes to check pdf/ppt crawling, I gone through the
existing parse-plugins.xml for reference and adding ppt,word,execl related
stuff in same file and tried 

Tika-parse ref: https://wiki.apache.org/nutch/Features
Mime type ref:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types

*Change 1:*
New fields added in parse-plugins.xml

*<mimeType name="application/vnd.ms-powerpoint">
         <plugin id="parse-tika" />
</mimeType>

<mimeType
name="application/vnd.openxmlformats-officedocument.presentationml.presentation">
         <plugin id="parse-tika" />
</mimeType>*
                
/Change 2:/
Allowed/enabled mime type via mimetype-filter.txt

# allow only documents with a text/html mimetype
application/pdf
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

/Change3:/

Added below entry in nutch-site.xml
Ref:
https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file

<property>
  <name>mime.types.file</name>
  <value>tika-mimetypes.xml</value>
  <description>Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information. Overrides the default
Tika config
  if specified.
  </description>
</property>

After adding above changes tried with crawl and getting below and failing.
Kindly someone review and guide me next steps 


2018-09-10 18:27:54,977 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-09-10 18:27:55,162 INFO  util.MimeUtil - Using custom mime.types.file:
tika-mimetypes.xml
*2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
tika-mimetypes.xml using Tika's default*
2018-09-10 18:27:56,553 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: content dest:
content
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: title dest:
title
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: host dest:
host
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: segment dest:
segment
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: boost dest:
boost
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: digest dest:
digest
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Deleting 0 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Deleting 0 documents
*2018-09-10 18:27:57,128 WARN  mapred.LocalJobRunner -
job_local1216759318_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body>
HTTP ERROR 404

<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

        at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. <html>*

Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Reply via email to