Hi,
crawling and indexing Office documents should work out-of-the-box without any
configuration changes, the plugin parse-tika is enabled by default in recent
Nutch versions. The only recommended change is to increase the content limit:
http.content.limit
65536
The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
Office documents tend to be larger than 64 kB and usually fail to parse
if truncated.
The Solr URL seems to be wrong: it's required to add the name of the "core",
e.g.,
http://localhost:8983/solr/nutch
see https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search
Best,
Sebastian
On 09/10/2018 04:32 PM, polu.amar wrote:
> Hi All,
>
> We are trying to crawl and index ppt and msword,excel mime type documents
> as part of seed url which .html page, i mean a seed url which is having
> *ppt,msword,ppt* as an attachment.
>
> ex: http://abc.com/solr-tika.html
>
> I have added below changes to check pdf/ppt crawling, I gone through the
> existing parse-plugins.xml for reference and adding ppt,word,execl related
> stuff in same file and tried
>
> Tika-parse ref: https://wiki.apache.org/nutch/Features
> Mime type ref:
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types
>
> *Change 1:*
> New fields added in parse-plugins.xml
>
> *
>
>
>
> name="application/vnd.openxmlformats-officedocument.presentationml.presentation">
>
> *
>
> /Change 2:/
> Allowed/enabled mime type via mimetype-filter.txt
>
> # allow only documents with a text/html mimetype
> application/pdf
> application/vnd.ms-powerpoint
> application/vnd.openxmlformats-officedocument.presentationml.presentation
> application/msword
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> application/vnd.ms-excel
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>
> /Change3:/
>
> Added below entry in nutch-site.xml
> Ref:
> https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file
>
>
> mime.types.file
> tika-mimetypes.xml
> Name of file in CLASSPATH containing filename extension and
> magic sequence to mime types mapping information. Overrides the default
> Tika config
> if specified.
>
>
>
> After adding above changes tried with crawl and getting below and failing.
> Kindly someone review and guide me next steps
>
>
> 2018-09-10 18:27:54,977 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-09-10 18:27:55,162 INFO util.MimeUtil - Using custom mime.types.file:
> tika-mimetypes.xml
> *2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
> tika-mimetypes.xml using Tika's default*
> 2018-09-10 18:27:56,553 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: content dest:
> content
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: title dest:
> title
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: host dest:
> host
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: segment dest:
> segment
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: boost dest:
> boost
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: digest dest:
> digest
> 2018-09-10 18:27:56,719 INFO solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2018-09-10 18:27:56,739 INFO solr.SolrIndexWriter - Indexing 1/1 documents
> 2018-09-10 18:27:56,739 INFO solr.SolrIndexWriter - Deleting 0 documents
> 2018-09-10 18:27:57,107 INFO solr.SolrIndexWriter - Indexing 1/1 documents
> 2018-09-10 18:27:57,107 INFO solr.SolrIndexWriter - Deleting 0 documents
> *2018-09-10 18:27:57,128 WARN mapred.LocalJobRunner -
> job_local1216759318_0001
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8983/solr: Expected mime type
> application/octet-stream but got text/html.
>
>
> Error 404 Not Found
>
>
> HTTP ERROR 404
>
> Problem accessing /solr/update. Reason:
> Not Found
>
>
>
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8983/solr: Expected mime type
> application/octet-stream but got text/html. *
>
> Thanks,
> Amarnath Polu
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>