Marcus, 

nutch_site.xml with... 

<property> 

<name> 

plugin.includes 

</name> 

<value> 

protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
 

</value> 

<description> 

item needed to parse metatags out of html. 

</description> 

</property> 



Throws the same errors. 




Kris 
----- Original Message -----

From: "Markus Jelsma" <[email protected]> 
To: [email protected] 
Sent: Tuesday, September 6, 2016 6:24:00 PM 
Subject: RE: indexing metatags with Nutch 1.12 

Hm, this is odd. You have protocol-http configured and it should work just like 
that. Change it to protocol-httpclient to confirm a problem. 
Protocol-httpclient supported https for a much longer time than protocol-http. 

If it works with httpclient, there is some weird problem never noticed before. 
M. 



-----Original message----- 
> From:Kris Musshorn <[email protected]> 
> Sent: Tuesday 6th September 2016 23:26 
> To: [email protected] 
> Subject: RE: indexing metatags with Nutch 1.12 
> 
> Marcus, 
> 
> Here is the nutch-site.xml in place when it throws errors that I posted 
> previously 
> 
> -----Original Message----- 
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Tuesday, September 6, 2016 3:02 PM 
> To: [email protected] 
> Subject: RE: indexing metatags with Nutch 1.12 
> 
> Well, so we did add https to protocol-http's plugin.xml. Does your 
> plugin.includes actually contain a protocol-* plugin? 
> 
> 
> 
> 
> -----Original message----- 
> > From:KRIS MUSSHORN <[email protected]> 
> > Sent: Tuesday 6th September 2016 20:39 
> > To: [email protected] 
> > Subject: Re: indexing metatags with Nutch 1.12 
> > 
> > Markus, 
> > I'm not sure how to answer your question. 
> > here are 2 xml files for your consideration. 
> > 
> > Kris 
> > 
> > ----------- 
> > From: "Markus Jelsma" <[email protected]> 
> > To: [email protected] 
> > Sent: Tuesday, September 6, 2016 2:30:39 PM 
> > Subject: RE: indexing metatags with Nutch 1.12 
> > 
> > Well, this is certainly not an indexing metatags problem. You need to use 
> > protocol-httpclient for https, or configure protocol-http's plugin.xml to 
> > support https. That's identical to protocol-httpclient's plugin.xml. 
> > 
> > On the other hand, when we added support for https to protocol-http, did we 
> > forget to add it to the plugin.xml? 
> > 
> > 
> > 
> > 
> > 
> > -----Original message----- 
> > > From:KRIS MUSSHORN <[email protected]> 
> > > Sent: Tuesday 6th September 2016 19:29 
> > > To: [email protected] 
> > > Subject: indexing metatags with Nutch 1.12 
> > > 
> > > https://wiki.apache.org/nutch/IndexMetatags 
> > > <https://wiki.apache.org/nutch/IndexMetatags> 
> > > 
> > > Soon as i switch to nutch-site_v2 nutch throws protocol missing errors 
> > > during crawl. 
> > > 
> > > 2016-09-06 12:23:53,102 INFO fetcher.Fetcher - -activeThreads=50, 
> > > spinWaiting=50, fetchQueues.totalSize=442, fetchQueues.getQueueCount=1 
> > > 2016-09-06 12:23:53,576 INFO fetcher.FetcherThread - fetching 
> > > https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf
> > >  (queue crawl delay=500ms) 
> > > 2016-09-06 12:23:53,576 INFO fetcher.FetcherThread - fetch of 
> > > https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf
> > >  failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not 
> > > found for url=https 
> > > at 
> > > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:84)
> > >  
> > > at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:257) 
> > > 
> > > how can i fix this? 
> > > 
> > > Kris 
> > > 
> > 
> 

Reply via email to