That is the latest patch i referred to. Download it and get yourself a copy of 1.7 sources or do a svn export of trunk. Have the patch in the root folder of the sources and patch with patch -p0 < file.patch
Build with $ ant and you got yourself a extracting Nutch in runtime/local/ For more info on patching etc, check the wiki or the many other mailing list topics. Cheers -----Original message----- > From:Reyes, Mark <[email protected]> > Sent: Thursday 14th November 2013 22:28 > To: [email protected] > Subject: Re: Preserve HTML that is being crawled from Nutch? > > How could I download the latest patch? > > Ive enabled nutch-site.xml with, > <property> > <name>tika.use_boilerpipe</name> > <value>true</value> > </property> > > > <property> > <name>tika.boilerpipe.extractor</name> > <value>ArticleExtractor</value> > </property> > > > > On 11/14/13, 12:26 PM, "Markus Jelsma" <[email protected]> wrote: > > >You need my latest patch: 17/Jun/13 16:34. This is for trunk (1.8) but > >also works on 1.7 and 1.6. > > > >Set the following options in your nutch-site: > >tika.use_boilerpipe=true > >tika.boilerpipe.extractor=ArticleExtractor or CanolaExtractor > > > >ArticleExtractor works best for, well, article style pages. The Canola > >will do a better job at extracting pages with many blocks, such as a > >forum topic page. There is no mechanism to decide which extractor to use > >so that is a problem in large crawls. You will get some or a lot of noise > >on many page types. When patched you will get the extracted text in > >Solr's content field. > > > >cheers > > > > > >-----Original message----- > >> From:Reyes, Mark <[email protected]> > >> Sent: Thursday 14th November 2013 21:10 > >> To: [email protected] > >> Subject: Re: Preserve HTML that is being crawled from Nutch? > >> > >> RE: https://issues.apache.org/jira/browse/NUTCH-961 > >> > >> Are there usage instructions on how to do this? > >> > >> The JIRA ticket shows several attachments. Is there a specific > >>attachment > >> to download? > >> > >> Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7 > >> crawl ‘almost’ as described from their respective tutorials. All I am > >> seeing from my zipped downloads from apache are .jar files. > >> > >> Thanks again, > >> Mark > >> > >> > >> > >> On 11/14/13, 2:45 AM, "Markus Jelsma" <[email protected]> > >>wrote: > >> > >> >By default title is indexed in the title field and using the headings > >> >plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as > >> >multi valued. Also by default is that all text is indexed into a > >>content > >> >field, including title and headings. You can try the NUTCH-961 issue > >>for > >> >actual content extraction or use something else. > >> > > >> >-----Original message----- > >> >> From:Reyes, Mark <[email protected]> > >> >> Sent: Wednesday 13th November 2013 17:50 > >> >> To: [email protected] > >> >> Subject: Re: Preserve HTML that is being crawled from Nutch? > >> >> > >> >> 1. If my HTML page is: > >> >> > >> >> <html> > >> >> <head> > >> >> <title>The web site</title> > >> >> </head> > >> >> > >> >> <body> > >> >> <div> > >> >> <h1>This is the web page of the web site.</h1> > >> >> <h2>This is a subcategory title for the web > >> >> page.</h2> > >> >> <p>This is the copy for the web page.</p> > >> >> </div> > >> >> </body> > >> >> </html> > >> >> > >> >> 2. Currently, the JSON that prints out from Solr is: > >> >> > >> >> { > >> >> content: "This is the web page of the web site. This is a > >>subcategory > >> >> title for the web page. This is the copy for the web page." > >> >> } > >> >> > >> >> 3. It's preferred to be more specific such as: > >> >> > >> >> { > >> >> contentTitle: ³<h1>This is the web page of the web site.</h1>², > >> >> contentSubTitle: ³<h2>This is a subcategory title for the web > >> >>page.</h2>", > >> >> contentBody: ³<p>This is the copy for the web page.</p>" > >> >> } > >> >> > >> >> 4. Optionally, the JSON could perhaps be like this: > >> >> > >> >> { > >> >> contentTitle: "This is the web page of the web site.², > >> >> contentSubTitle: "This is a subcategory title for the web > >> >> page.", > >> >> contentBody: This is the copy for the web page." > >> >> } > >> >> > >> >> > >> >> I'm guessing this has to occur from augmenting the schema? On a > >> >>side-note, > >> >> this may go back to the inquiry I had earlier about the XPath. > >> >> > >> >> > >> >> Thanks again, > >> >> Mark > >> >> > >> >> > >> >> > >> >> On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]> > >> >>wrote: > >> >> > >> >> >I am not sure what you mean. The raw content, including the HTML, is > >> >> >stored on disk by default. Each segment has a content directory > >> >> >containing just that. But i don't know what you mean by markup as > >> >>indexed? > >> >> > > >> >> >-----Original message----- > >> >> >> From:Reyes, Mark <[email protected]> > >> >> >> Sent: Wednesday 13th November 2013 2:57 > >> >> >> To: [email protected] > >> >> >> Subject: Preserve HTML that is being crawled from Nutch? > >> >> >> > >> >> >> Is there a way to preserve the HTML that is being crawled from > >>Nutch > >> >> >>1.7? > >> >> >> > >> >> >> Specifically, instead of normalizing the information that is > >>crawled > >> >> >>into a long string value then assigning that to the Œcontent¹ key > >>(if > >> >> >>viewing in JSON), I¹d like to see the markup itself as indexed. > >> >> >> > >> >> >> Thanks, > >> >> >> Mark > >> >> >> > >> >> >> > >> >> >> IMPORTANT NOTICE: This e-mail message is intended to be received > >>only > >> >> >>by persons entitled to receive the confidential information it may > >> >> >>contain. E-mail messages sent from Bridgepoint Education may > >>contain > >> >> >>information that is confidential and may be legally privileged. > >>Please > >> >> >>do not read, copy, forward or store this message unless you are an > >> >> >>intended recipient of it. If you received this transmission in > >>error, > >> >> >>please notify the sender by reply e-mail and delete the message and > >> >>any > >> >> >>attachments. > >> >> > >> >> > >> >> IMPORTANT NOTICE: This e-mail message is intended to be received only > >> >>by persons entitled to receive the confidential information it may > >> >>contain. E-mail messages sent from Bridgepoint Education may contain > >> >>information that is confidential and may be legally privileged. Please > >> >>do not read, copy, forward or store this message unless you are an > >> >>intended recipient of it. If you received this transmission in error, > >> >>please notify the sender by reply e-mail and delete the message and > >>any > >> >>attachments. > >> > >> > >> IMPORTANT NOTICE: This e-mail message is intended to be received only > >>by persons entitled to receive the confidential information it may > >>contain. E-mail messages sent from Bridgepoint Education may contain > >>information that is confidential and may be legally privileged. Please > >>do not read, copy, forward or store this message unless you are an > >>intended recipient of it. If you received this transmission in error, > >>please notify the sender by reply e-mail and delete the message and any > >>attachments. > > > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages sent from Bridgepoint Education may contain information that > is confidential and may be legally privileged. Please do not read, copy, > forward or store this message unless you are an intended recipient of it. If > you received this transmission in error, please notify the sender by reply > e-mail and delete the message and any attachments.

