from:"polu.amar"

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-11 Thread polu.amar

Hi Sebastian ,

Thanks for the update, with the default settings it's not crawling/indexing
for Microsoft office documents(ppt,word,excel etc).

For *http.content.limit* property value we already make it as
unlimited*(-1)*.

Do we need to change any kind of updates in development(AEM 6.3 is
technology,where we are developing a page) side for office kind of
documents? or any solr side changes?

Note: I passed solr url properly(seems it's was missed in ticket) as part of
crawl script

:>*bin/crawl -i -D
solr.server.url=http://localhost:8983/solr/tikaparsecollection  -s urls/
crawl/  -1*

solr collection name: tikaparsecollection
seed.txt: http://abc.com/solr-tika.html  

Kindly, assist us on how to achieve these kind of case in nutch crawling. 


Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-10 Thread polu.amar

Hi All,

We are trying to crawl and index ppt and msword,excel  mime type documents
as part of seed url which .html page, i mean a seed url which is having
*ppt,msword,ppt* as an attachment.

ex: http://abc.com/solr-tika.html 

I have added below changes to check pdf/ppt crawling, I gone through the
existing parse-plugins.xml for reference and adding ppt,word,execl related
stuff in same file and tried 

Tika-parse ref: https://wiki.apache.org/nutch/Features
Mime type ref:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types

*Change 1:*
New fields added in parse-plugins.xml

*
 



 
*

/Change 2:/
Allowed/enabled mime type via mimetype-filter.txt

# allow only documents with a text/html mimetype
application/pdf
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

/Change3:/

Added below entry in nutch-site.xml
Ref:
https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file


  mime.types.file
  tika-mimetypes.xml
  Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information. Overrides the default
Tika config
  if specified.
  


After adding above changes tried with crawl and getting below and failing.
Kindly someone review and guide me next steps 


2018-09-10 18:27:54,977 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-09-10 18:27:55,162 INFO  util.MimeUtil - Using custom mime.types.file:
tika-mimetypes.xml
*2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
tika-mimetypes.xml using Tika's default*
2018-09-10 18:27:56,553 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: content dest:
content
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: title dest:
title
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: host dest:
host
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: segment dest:
segment
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: boost dest:
boost
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: digest dest:
digest
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Deleting 0 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Deleting 0 documents
*2018-09-10 18:27:57,128 WARN  mapred.LocalJobRunner -
job_local1216759318_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. 


Error 404 Not Found


HTTP ERROR 404

Problem accessing /solr/update. Reason:
Not Found



at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. *

Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: redirect bin/crwal log output to some other file

2018-09-10 Thread polu.amar

Hi Lewis,

Thanks for your valuable time and yes Option one is fine for time being.

Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

Re: redirect bin/crwal log output to some other file

3 matches

Site Navigation

Mail list logo

Footer information