You need my latest patch: 17/Jun/13 16:34. This is for trunk (1.8) but also works on 1.7 and 1.6.
Set the following options in your nutch-site: tika.use_boilerpipe=true tika.boilerpipe.extractor=ArticleExtractor or CanolaExtractor ArticleExtractor works best for, well, article style pages. The Canola will do a better job at extracting pages with many blocks, such as a forum topic page. There is no mechanism to decide which extractor to use so that is a problem in large crawls. You will get some or a lot of noise on many page types. When patched you will get the extracted text in Solr's content field. cheers -----Original message----- > From:Reyes, Mark <[email protected]> > Sent: Thursday 14th November 2013 21:10 > To: [email protected] > Subject: Re: Preserve HTML that is being crawled from Nutch? > > RE: https://issues.apache.org/jira/browse/NUTCH-961 > > Are there usage instructions on how to do this? > > The JIRA ticket shows several attachments. Is there a specific attachment > to download? > > Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7 > crawl ‘almost’ as described from their respective tutorials. All I am > seeing from my zipped downloads from apache are .jar files. > > Thanks again, > Mark > > > > On 11/14/13, 2:45 AM, "Markus Jelsma" <[email protected]> wrote: > > >By default title is indexed in the title field and using the headings > >plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as > >multi valued. Also by default is that all text is indexed into a content > >field, including title and headings. You can try the NUTCH-961 issue for > >actual content extraction or use something else. > > > >-----Original message----- > >> From:Reyes, Mark <[email protected]> > >> Sent: Wednesday 13th November 2013 17:50 > >> To: [email protected] > >> Subject: Re: Preserve HTML that is being crawled from Nutch? > >> > >> 1. If my HTML page is: > >> > >> <html> > >> <head> > >> <title>The web site</title> > >> </head> > >> > >> <body> > >> <div> > >> <h1>This is the web page of the web site.</h1> > >> <h2>This is a subcategory title for the web page.</h2> > >> <p>This is the copy for the web page.</p> > >> </div> > >> </body> > >> </html> > >> > >> 2. Currently, the JSON that prints out from Solr is: > >> > >> { > >> content: "This is the web page of the web site. This is a subcategory > >> title for the web page. This is the copy for the web page." > >> } > >> > >> 3. It's preferred to be more specific such as: > >> > >> { > >> contentTitle: ³<h1>This is the web page of the web site.</h1>², > >> contentSubTitle: ³<h2>This is a subcategory title for the web > >>page.</h2>", > >> contentBody: ³<p>This is the copy for the web page.</p>" > >> } > >> > >> 4. Optionally, the JSON could perhaps be like this: > >> > >> { > >> contentTitle: "This is the web page of the web site.², > >> contentSubTitle: "This is a subcategory title for the web page.", > >> contentBody: This is the copy for the web page." > >> } > >> > >> > >> I'm guessing this has to occur from augmenting the schema? On a > >>side-note, > >> this may go back to the inquiry I had earlier about the XPath. > >> > >> > >> Thanks again, > >> Mark > >> > >> > >> > >> On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]> > >>wrote: > >> > >> >I am not sure what you mean. The raw content, including the HTML, is > >> >stored on disk by default. Each segment has a content directory > >> >containing just that. But i don't know what you mean by markup as > >>indexed? > >> > > >> >-----Original message----- > >> >> From:Reyes, Mark <[email protected]> > >> >> Sent: Wednesday 13th November 2013 2:57 > >> >> To: [email protected] > >> >> Subject: Preserve HTML that is being crawled from Nutch? > >> >> > >> >> Is there a way to preserve the HTML that is being crawled from Nutch > >> >>1.7? > >> >> > >> >> Specifically, instead of normalizing the information that is crawled > >> >>into a long string value then assigning that to the Œcontent¹ key (if > >> >>viewing in JSON), I¹d like to see the markup itself as indexed. > >> >> > >> >> Thanks, > >> >> Mark > >> >> > >> >> > >> >> IMPORTANT NOTICE: This e-mail message is intended to be received only > >> >>by persons entitled to receive the confidential information it may > >> >>contain. E-mail messages sent from Bridgepoint Education may contain > >> >>information that is confidential and may be legally privileged. Please > >> >>do not read, copy, forward or store this message unless you are an > >> >>intended recipient of it. If you received this transmission in error, > >> >>please notify the sender by reply e-mail and delete the message and > >>any > >> >>attachments. > >> > >> > >> IMPORTANT NOTICE: This e-mail message is intended to be received only > >>by persons entitled to receive the confidential information it may > >>contain. E-mail messages sent from Bridgepoint Education may contain > >>information that is confidential and may be legally privileged. Please > >>do not read, copy, forward or store this message unless you are an > >>intended recipient of it. If you received this transmission in error, > >>please notify the sender by reply e-mail and delete the message and any > >>attachments. > > > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages sent from Bridgepoint Education may contain information that > is confidential and may be legally privileged. Please do not read, copy, > forward or store this message unless you are an intended recipient of it. If > you received this transmission in error, please notify the sender by reply > e-mail and delete the message and any attachments.

