RE: Preserve HTML that is being crawled from Nutch?

Markus Jelsma Thu, 14 Nov 2013 12:32:57 -0800

You need my latest patch: 17/Jun/13 16:34. This is for trunk (1.8) but also 
works on 1.7 and 1.6.


Set the following options in your nutch-site:
tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor or CanolaExtractor

ArticleExtractor works best for, well, article style pages. The Canola will do 
a better job at extracting pages with many blocks, such as a forum topic page. 
There is no mechanism to decide which extractor to use so that is a problem in 
large crawls. You will get some or a lot of noise on many page types. When 
patched you will get the extracted text in Solr's content field.

cheers
 
 
-----Original message-----
> From:Reyes, Mark <[email protected]>
> Sent: Thursday 14th November 2013 21:10
> To: [email protected]
> Subject: Re: Preserve HTML that is being crawled from Nutch?
> 
> RE: https://issues.apache.org/jira/browse/NUTCH-961
> 
> Are there usage instructions on how to do this?
> 
> The JIRA ticket shows several attachments. Is there a specific attachment
> to download? 
> 
> Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7
> crawl ‘almost’ as described from their respective tutorials. All I am
> seeing from my zipped downloads from apache are .jar files.
> 
> Thanks again,
> Mark
> 
> 
> 
> On 11/14/13, 2:45 AM, "Markus Jelsma" <[email protected]> wrote:
> 
> >By default title is indexed in the title field and using the headings
> >plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as
> >multi valued. Also by default is that all text is indexed into a content
> >field, including title and headings. You can try the NUTCH-961 issue for
> >actual content extraction or use something else.
> > 
> >-----Original message-----
> >> From:Reyes, Mark <[email protected]>
> >> Sent: Wednesday 13th November 2013 17:50
> >> To: [email protected]
> >> Subject: Re: Preserve HTML that is being crawled from Nutch?
> >> 
> >> 1. If my HTML page is:
> >> 
> >> <html>
> >>    <head>
> >>            <title>The web site</title>
> >>    </head>
> >>    
> >>    <body>
> >>            <div>           
> >>                    <h1>This is the web page of the web site.</h1>
> >>                    <h2>This is a subcategory title for the web page.</h2>
> >>                    <p>This is the copy for the web page.</p>
> >>            </div>
> >>    </body>
> >> </html>
> >> 
> >> 2. Currently, the JSON that prints out from Solr is:
> >> 
> >> {
> >>    content: "This is the web page of the web site. This is a subcategory
> >> title for the web page. This is the copy for the web page."
> >> }
> >> 
> >> 3. It's preferred to be more specific such as:
> >> 
> >> {
> >>    contentTitle: ³<h1>This is the web page of the web site.</h1>², 
> >>    contentSubTitle: ³<h2>This is a subcategory title for the web
> >>page.</h2>",
> >>    contentBody: ³<p>This is the copy for the web page.</p>"
> >> }
> >> 
> >> 4. Optionally, the JSON could perhaps be like this:
> >> 
> >> {
> >>    contentTitle: "This is the web page of the web site.²,  
> >>    contentSubTitle: "This is a subcategory title for the web page.",
> >>    contentBody: This is the copy for the web page."
> >> }
> >> 
> >> 
> >> I'm guessing this has to occur from augmenting the schema? On a
> >>side-note,
> >> this may go back to the inquiry I had earlier about the XPath.
> >> 
> >> 
> >> Thanks again,
> >> Mark
> >> 
> >> 
> >> 
> >> On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]>
> >>wrote:
> >> 
> >> >I am not sure what you mean. The raw content, including the HTML, is
> >> >stored on disk by default. Each segment has a content directory
> >> >containing just that. But i don't know what you mean by markup as
> >>indexed?
> >> > 
> >> >-----Original message-----
> >> >> From:Reyes, Mark <[email protected]>
> >> >> Sent: Wednesday 13th November 2013 2:57
> >> >> To: [email protected]
> >> >> Subject: Preserve HTML that is being crawled from Nutch?
> >> >> 
> >> >> Is there a way to preserve the HTML that is being crawled from Nutch
> >> >>1.7?
> >> >> 
> >> >> Specifically, instead of normalizing the information that is crawled
> >> >>into a long string value then assigning that to the Œcontent¹ key (if
> >> >>viewing in JSON), I¹d like to see the markup itself as indexed.
> >> >> 
> >> >> Thanks,
> >> >> Mark
> >> >> 
> >> >> 
> >> >> IMPORTANT NOTICE: This e-mail message is intended to be received only
> >> >>by persons entitled to receive the confidential information it may
> >> >>contain. E-mail messages sent from Bridgepoint Education may contain
> >> >>information that is confidential and may be legally privileged. Please
> >> >>do not read, copy, forward or store this message unless you are an
> >> >>intended recipient of it. If you received this transmission in error,
> >> >>please notify the sender by reply e-mail and delete the message and
> >>any
> >> >>attachments.
> >> 
> >> 
> >> IMPORTANT NOTICE: This e-mail message is intended to be received only
> >>by persons entitled to receive the confidential information it may
> >>contain. E-mail messages sent from Bridgepoint Education may contain
> >>information that is confidential and may be legally privileged. Please
> >>do not read, copy, forward or store this message unless you are an
> >>intended recipient of it. If you received this transmission in error,
> >>please notify the sender by reply e-mail and delete the message and any
> >>attachments.
> 
> 
> IMPORTANT NOTICE: This e-mail message is intended to be received only by 
> persons entitled to receive the confidential information it may contain. 
> E-mail messages sent from Bridgepoint Education may contain information that 
> is confidential and may be legally privileged. Please do not read, copy, 
> forward or store this message unless you are an intended recipient of it. If 
> you received this transmission in error, please notify the sender by reply 
> e-mail and delete the message and any attachments.

RE: Preserve HTML that is being crawled from Nutch?

Reply via email to