RE: indexing metatags with Nutch 1.12

Kris Musshorn Tue, 06 Sep 2016 14:26:16 -0700

Marcus,

Here is the nutch-site.xml in place when it throws errors that I posted 
previously


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Tuesday, September 6, 2016 3:02 PM
To: [email protected]
Subject: RE: indexing metatags with Nutch 1.12

Well, so we did add https to protocol-http's plugin.xml. Does your 
plugin.includes actually contain a protocol-* plugin?


 
 
-----Original message-----
> From:KRIS MUSSHORN <[email protected]>
> Sent: Tuesday 6th September 2016 20:39
> To: [email protected]
> Subject: Re: indexing metatags with Nutch 1.12
> 
> Markus, 
> I'm not sure how to answer your question.
> here are 2 xml files for your consideration.
> 
> Kris
> 
> ----------- 
> From: "Markus Jelsma" <[email protected]>
> To: [email protected]
> Sent: Tuesday, September 6, 2016 2:30:39 PM
> Subject: RE: indexing metatags with Nutch 1.12
> 
> Well, this is certainly not an indexing metatags problem. You need to use 
> protocol-httpclient for https, or configure protocol-http's plugin.xml to 
> support https. That's identical to protocol-httpclient's plugin.xml.
> 
> On the other hand, when we added support for https to protocol-http, did we 
> forget to add it to the plugin.xml?
> 
> 
> 
>  
>  
> -----Original message-----
> > From:KRIS MUSSHORN <[email protected]>
> > Sent: Tuesday 6th September 2016 19:29
> > To: [email protected]
> > Subject: indexing metatags with Nutch 1.12
> > 
> > https://wiki.apache.org/nutch/IndexMetatags 
> > <https://wiki.apache.org/nutch/IndexMetatags>
> > 
> > Soon as i switch to nutch-site_v2 nutch throws protocol missing errors 
> > during crawl.
> > 
> > 2016-09-06 12:23:53,102 INFO  fetcher.Fetcher - -activeThreads=50, 
> > spinWaiting=50, fetchQueues.totalSize=442, fetchQueues.getQueueCount=1
> > 2016-09-06 12:23:53,576 INFO  fetcher.FetcherThread - fetching 
> > https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf
> >  (queue crawl delay=500ms)
> > 2016-09-06 12:23:53,576 INFO  fetcher.FetcherThread - fetch of 
> > https://snip/inside/events/events_summary/documents/Harford_Co_Sheriff_Special_Brief.pdf
> >  failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not 
> > found for url=https
> >     at 
> > org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:84)
> >     at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:257) 
> > 
> > how can i fix this?
> > 
> > Kris
> > 
>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- start items to parse metatags ref https://wiki.apache.org/nutch/IndexMetatags -->	
	<property>
		<name>
			plugin.includes
		</name>
		<value>
			protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
		</value>
		<description>
			item needed to parse metatags out of html.
		</description>
	</property>
	
	<!-- use only if plugin enabled -->
	<property>
		<name>
			metatags.names
		</name>
		<value>
			description,keywords
		</value>
		<description>
			The metatagindexer uses the output of the parsing above to create two fields.
			Note the value is multivalued.
		</description>
	</property>
	<!-- configure the index-metadata plugin -->
	<property>
		<name>
			index.parse.md
		</name>
		<value>
			metatag.description,metatag.keywords
		</value>
	</property>
	<property>
		<name>
			index.metadata
		</name>
		<value>
			description,keywords
		</value>
	</property>
			
		
	
<!-- end items to parse metatags -->	

<!-- start metatag boosts -->	
	
	<property>
		<name>
			query.basic.description.boost
		</name>
		<value>
			2.0
		</value>
	</property>

	<property>
		<name>
			query.basic.keywords.boost
		</name>
		<value>
			2.0
		</value>
	</property>
			
<!-- end metatag boosts -->	
		
	<property>
		<name>
			http.agent.name
		</name>
		<value>
			ARLInside_spider
		</value>
		<description>
			The name of my spider instance
		</description>
	</property>

	<property>
		<name>
			db.max.outlinks.per.page
		</name>
		<value>
			-1
		</value>
		<description>
			allow unlimited outlinks
		</description>
	</property>
	
	<property>
		<name>
			http.content.limit
		</name>
		<value>
			32765
		</value>
		<description>
			the length limit for downloaded http content in bytes.
			if this value is no negative content longer than it will be truncated.
			32766 is the max limit for SOLR 6.1.0 so this is set to 1 byte less.
		</description>
	</property>

	<property>
		<name>
			fetcher.server.delay
		</name>
		<value>
			0.5
		</value>
		<description> 
			the time between fetch calls in seconds
		</description>
	</property>

</configuration>

RE: indexing metatags with Nutch 1.12

Reply via email to