RE: Preserve HTML that is being crawled from Nutch?

Markus Jelsma Thu, 14 Nov 2013 14:10:20 -0800

That is the latest patch i referred to. Download it and get yourself a copy of 
1.7 sources or do a svn export of trunk. Have the patch in the root folder of 
the sources and patch with patch -p0 < file.patch


Build with $ ant and you got yourself a extracting Nutch in runtime/local/

For more info on patching etc, check the wiki or the many other mailing list 
topics.

Cheers

 
 
-----Original message-----
> From:Reyes, Mark <[email protected]>
> Sent: Thursday 14th November 2013 22:28
> To: [email protected]
> Subject: Re: Preserve HTML that is being crawled from Nutch?
> 
> How could I download the latest patch?
> 
> Ive enabled nutch-site.xml with,
> <property>
>       <name>tika.use_boilerpipe</name>
>       <value>true</value>
>       </property>
> 
>       
>       <property>
>       <name>tika.boilerpipe.extractor</name>
>       <value>ArticleExtractor</value>
>       </property>     
> 
> 
> 
> On 11/14/13, 12:26 PM, "Markus Jelsma" <[email protected]> wrote:
> 
> >You need my latest patch: 17/Jun/13 16:34. This is for trunk (1.8) but
> >also works on 1.7 and 1.6.
> >
> >Set the following options in your nutch-site:
> >tika.use_boilerpipe=true
> >tika.boilerpipe.extractor=ArticleExtractor or CanolaExtractor
> >
> >ArticleExtractor works best for, well, article style pages. The Canola
> >will do a better job at extracting pages with many blocks, such as a
> >forum topic page. There is no mechanism to decide which extractor to use
> >so that is a problem in large crawls. You will get some or a lot of noise
> >on many page types. When patched you will get the extracted text in
> >Solr's content field.
> >
> >cheers
> > 
> > 
> >-----Original message-----
> >> From:Reyes, Mark <[email protected]>
> >> Sent: Thursday 14th November 2013 21:10
> >> To: [email protected]
> >> Subject: Re: Preserve HTML that is being crawled from Nutch?
> >> 
> >> RE: https://issues.apache.org/jira/browse/NUTCH-961
> >> 
> >> Are there usage instructions on how to do this?
> >> 
> >> The JIRA ticket shows several attachments. Is there a specific
> >>attachment
> >> to download? 
> >> 
> >> Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7
> >> crawl ‘almost’ as described from their respective tutorials. All I am
> >> seeing from my zipped downloads from apache are .jar files.
> >> 
> >> Thanks again,
> >> Mark
> >> 
> >> 
> >> 
> >> On 11/14/13, 2:45 AM, "Markus Jelsma" <[email protected]>
> >>wrote:
> >> 
> >> >By default title is indexed in the title field and using the headings
> >> >plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as
> >> >multi valued. Also by default is that all text is indexed into a
> >>content
> >> >field, including title and headings. You can try the NUTCH-961 issue
> >>for
> >> >actual content extraction or use something else.
> >> > 
> >> >-----Original message-----
> >> >> From:Reyes, Mark <[email protected]>
> >> >> Sent: Wednesday 13th November 2013 17:50
> >> >> To: [email protected]
> >> >> Subject: Re: Preserve HTML that is being crawled from Nutch?
> >> >> 
> >> >> 1. If my HTML page is:
> >> >> 
> >> >> <html>
> >> >>         <head>
> >> >>                 <title>The web site</title>
> >> >>         </head>
> >> >>         
> >> >>         <body>
> >> >>                 <div>           
> >> >>                         <h1>This is the web page of the web site.</h1>
> >> >>                         <h2>This is a subcategory title for the web 
> >> >> page.</h2>
> >> >>                         <p>This is the copy for the web page.</p>
> >> >>                 </div>
> >> >>         </body>
> >> >> </html>
> >> >> 
> >> >> 2. Currently, the JSON that prints out from Solr is:
> >> >> 
> >> >> {
> >> >>         content: "This is the web page of the web site. This is a
> >>subcategory
> >> >> title for the web page. This is the copy for the web page."
> >> >> }
> >> >> 
> >> >> 3. It's preferred to be more specific such as:
> >> >> 
> >> >> {
> >> >>         contentTitle: ³<h1>This is the web page of the web site.</h1>², 
> >> >>         contentSubTitle: ³<h2>This is a subcategory title for the web
> >> >>page.</h2>",
> >> >>         contentBody: ³<p>This is the copy for the web page.</p>"
> >> >> }
> >> >> 
> >> >> 4. Optionally, the JSON could perhaps be like this:
> >> >> 
> >> >> {
> >> >>         contentTitle: "This is the web page of the web site.²,  
> >> >>         contentSubTitle: "This is a subcategory title for the web 
> >> >> page.",
> >> >>         contentBody: This is the copy for the web page."
> >> >> }
> >> >> 
> >> >> 
> >> >> I'm guessing this has to occur from augmenting the schema? On a
> >> >>side-note,
> >> >> this may go back to the inquiry I had earlier about the XPath.
> >> >> 
> >> >> 
> >> >> Thanks again,
> >> >> Mark
> >> >> 
> >> >> 
> >> >> 
> >> >> On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]>
> >> >>wrote:
> >> >> 
> >> >> >I am not sure what you mean. The raw content, including the HTML, is
> >> >> >stored on disk by default. Each segment has a content directory
> >> >> >containing just that. But i don't know what you mean by markup as
> >> >>indexed?
> >> >> > 
> >> >> >-----Original message-----
> >> >> >> From:Reyes, Mark <[email protected]>
> >> >> >> Sent: Wednesday 13th November 2013 2:57
> >> >> >> To: [email protected]
> >> >> >> Subject: Preserve HTML that is being crawled from Nutch?
> >> >> >> 
> >> >> >> Is there a way to preserve the HTML that is being crawled from
> >>Nutch
> >> >> >>1.7?
> >> >> >> 
> >> >> >> Specifically, instead of normalizing the information that is
> >>crawled
> >> >> >>into a long string value then assigning that to the Œcontent¹ key
> >>(if
> >> >> >>viewing in JSON), I¹d like to see the markup itself as indexed.
> >> >> >> 
> >> >> >> Thanks,
> >> >> >> Mark
> >> >> >> 
> >> >> >> 
> >> >> >> IMPORTANT NOTICE: This e-mail message is intended to be received
> >>only
> >> >> >>by persons entitled to receive the confidential information it may
> >> >> >>contain. E-mail messages sent from Bridgepoint Education may
> >>contain
> >> >> >>information that is confidential and may be legally privileged.
> >>Please
> >> >> >>do not read, copy, forward or store this message unless you are an
> >> >> >>intended recipient of it. If you received this transmission in
> >>error,
> >> >> >>please notify the sender by reply e-mail and delete the message and
> >> >>any
> >> >> >>attachments.
> >> >> 
> >> >> 
> >> >> IMPORTANT NOTICE: This e-mail message is intended to be received only
> >> >>by persons entitled to receive the confidential information it may
> >> >>contain. E-mail messages sent from Bridgepoint Education may contain
> >> >>information that is confidential and may be legally privileged. Please
> >> >>do not read, copy, forward or store this message unless you are an
> >> >>intended recipient of it. If you received this transmission in error,
> >> >>please notify the sender by reply e-mail and delete the message and
> >>any
> >> >>attachments.
> >> 
> >> 
> >> IMPORTANT NOTICE: This e-mail message is intended to be received only
> >>by persons entitled to receive the confidential information it may
> >>contain. E-mail messages sent from Bridgepoint Education may contain
> >>information that is confidential and may be legally privileged. Please
> >>do not read, copy, forward or store this message unless you are an
> >>intended recipient of it. If you received this transmission in error,
> >>please notify the sender by reply e-mail and delete the message and any
> >>attachments.
> 
> 
> IMPORTANT NOTICE: This e-mail message is intended to be received only by 
> persons entitled to receive the confidential information it may contain. 
> E-mail messages sent from Bridgepoint Education may contain information that 
> is confidential and may be legally privileged. Please do not read, copy, 
> forward or store this message unless you are an intended recipient of it. If 
> you received this transmission in error, please notify the sender by reply 
> e-mail and delete the message and any attachments.

RE: Preserve HTML that is being crawled from Nutch?

Reply via email to