Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-10 Thread polu.amar
Hi Sebastian , Thanks for the update, with the default settings it's not crawling/indexing for Microsoft office documents(ppt,word,excel etc). For *http.content.limit* property value we already make it as unlimited*(-1)*. Do we need to change any kind of updates in development(AEM 6.3 is technol

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-10 Thread Sebastian Nagel
Hi, crawling and indexing Office documents should work out-of-the-box without any configuration changes, the plugin parse-tika is enabled by default in recent Nutch versions. The only recommended change is to increase the content limit: http.content.limit 65536 The length limit for downlo

crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

2018-09-10 Thread polu.amar
Hi All, We are trying to crawl and index ppt and msword,excel mime type documents as part of seed url which .html page, i mean a seed url which is having *ppt,msword,ppt* as an attachment. ex: http://abc.com/solr-tika.html I have added below changes to check pdf/ppt crawling, I gone through th