Hi All,
We are trying to crawl and index ppt and msword,excel mime type documents
as part of seed url which .html page, i mean a seed url which is having
*ppt,msword,ppt* as an attachment.
ex: http://abc.com/solr-tika.html
I have added below changes to check pdf/ppt crawling, I gone through the
existing parse-plugins.xml for reference and adding ppt,word,execl related
stuff in same file and tried
Tika-parse ref: https://wiki.apache.org/nutch/Features
Mime type ref:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types
*Change 1:*
New fields added in parse-plugins.xml
*
*
/Change 2:/
Allowed/enabled mime type via mimetype-filter.txt
# allow only documents with a text/html mimetype
application/pdf
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
/Change3:/
Added below entry in nutch-site.xml
Ref:
https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file
mime.types.file
tika-mimetypes.xml
Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information. Overrides the default
Tika config
if specified.
After adding above changes tried with crawl and getting below and failing.
Kindly someone review and guide me next steps
2018-09-10 18:27:54,977 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-09-10 18:27:55,162 INFO util.MimeUtil - Using custom mime.types.file:
tika-mimetypes.xml
*2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
tika-mimetypes.xml using Tika's default*
2018-09-10 18:27:56,553 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: content dest:
content
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: title dest:
title
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: host dest:
host
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: segment dest:
segment
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: boost dest:
boost
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: digest dest:
digest
2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: tstamp dest:
tstamp
2018-09-10 18:27:56,739 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:56,739 INFO solr.SolrIndexWriter - Deleting 0 documents
2018-09-10 18:27:57,107 INFO solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:57,107 INFO solr.SolrIndexWriter - Deleting 0 documents
*2018-09-10 18:27:57,128 WARN mapred.LocalJobRunner -
job_local1216759318_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html.
Error 404 Not Found
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. *
Thanks,
Amarnath Polu
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html