Re: Preserve HTML that is being crawled from Nutch?

Reyes, Mark Thu, 14 Nov 2013 12:11:07 -0800

RE: https://issues.apache.org/jira/browse/NUTCH-961


Are there usage instructions on how to do this?

The JIRA ticket shows several attachments. Is there a specific attachment
to download? 

Please keep in mind that I am running my Solr 4.5 instance and Nutch 1.7
crawl ‘almost’ as described from their respective tutorials. All I am
seeing from my zipped downloads from apache are .jar files.

Thanks again,
Mark



On 11/14/13, 2:45 AM, "Markus Jelsma" <[email protected]> wrote:

>By default title is indexed in the title field and using the headings
>plugin the h1 and h2 etc are indexed as h1..h2 as well, optionally as
>multi valued. Also by default is that all text is indexed into a content
>field, including title and headings. You can try the NUTCH-961 issue for
>actual content extraction or use something else.
> 
>-----Original message-----
>> From:Reyes, Mark <[email protected]>
>> Sent: Wednesday 13th November 2013 17:50
>> To: [email protected]
>> Subject: Re: Preserve HTML that is being crawled from Nutch?
>> 
>> 1. If my HTML page is:
>> 
>> <html>
>>      <head>
>>              <title>The web site</title>
>>      </head>
>>      
>>      <body>
>>              <div>           
>>                      <h1>This is the web page of the web site.</h1>
>>                      <h2>This is a subcategory title for the web page.</h2>
>>                      <p>This is the copy for the web page.</p>
>>              </div>
>>      </body>
>> </html>
>> 
>> 2. Currently, the JSON that prints out from Solr is:
>> 
>> {
>>      content: "This is the web page of the web site. This is a subcategory
>> title for the web page. This is the copy for the web page."
>> }
>> 
>> 3. It's preferred to be more specific such as:
>> 
>> {
>>      contentTitle: ³<h1>This is the web page of the web site.</h1>², 
>>      contentSubTitle: ³<h2>This is a subcategory title for the web
>>page.</h2>",
>>      contentBody: ³<p>This is the copy for the web page.</p>"
>> }
>> 
>> 4. Optionally, the JSON could perhaps be like this:
>> 
>> {
>>      contentTitle: "This is the web page of the web site.²,  
>>      contentSubTitle: "This is a subcategory title for the web page.",
>>      contentBody: This is the copy for the web page."
>> }
>> 
>> 
>> I'm guessing this has to occur from augmenting the schema? On a
>>side-note,
>> this may go back to the inquiry I had earlier about the XPath.
>> 
>> 
>> Thanks again,
>> Mark
>> 
>> 
>> 
>> On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]>
>>wrote:
>> 
>> >I am not sure what you mean. The raw content, including the HTML, is
>> >stored on disk by default. Each segment has a content directory
>> >containing just that. But i don't know what you mean by markup as
>>indexed?
>> > 
>> >-----Original message-----
>> >> From:Reyes, Mark <[email protected]>
>> >> Sent: Wednesday 13th November 2013 2:57
>> >> To: [email protected]
>> >> Subject: Preserve HTML that is being crawled from Nutch?
>> >> 
>> >> Is there a way to preserve the HTML that is being crawled from Nutch
>> >>1.7?
>> >> 
>> >> Specifically, instead of normalizing the information that is crawled
>> >>into a long string value then assigning that to the Œcontent¹ key (if
>> >>viewing in JSON), I¹d like to see the markup itself as indexed.
>> >> 
>> >> Thanks,
>> >> Mark
>> >> 
>> >> 
>> >> IMPORTANT NOTICE: This e-mail message is intended to be received only
>> >>by persons entitled to receive the confidential information it may
>> >>contain. E-mail messages sent from Bridgepoint Education may contain
>> >>information that is confidential and may be legally privileged. Please
>> >>do not read, copy, forward or store this message unless you are an
>> >>intended recipient of it. If you received this transmission in error,
>> >>please notify the sender by reply e-mail and delete the message and
>>any
>> >>attachments.
>> 
>> 
>> IMPORTANT NOTICE: This e-mail message is intended to be received only
>>by persons entitled to receive the confidential information it may
>>contain. E-mail messages sent from Bridgepoint Education may contain
>>information that is confidential and may be legally privileged. Please
>>do not read, copy, forward or store this message unless you are an
>>intended recipient of it. If you received this transmission in error,
>>please notify the sender by reply e-mail and delete the message and any
>>attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Preserve HTML that is being crawled from Nutch?

Reply via email to