Re: Preserve HTML that is being crawled from Nutch?

Reyes, Mark Wed, 13 Nov 2013 08:50:25 -0800

1. If my HTML page is:

<html>
        <head>
                <title>The web site</title>
        </head>
        
        <body>
                <div>           
                        <h1>This is the web page of the web site.</h1>
                        <h2>This is a subcategory title for the web page.</h2>
                        <p>This is the copy for the web page.</p>
                </div>
        </body>
</html>

2. Currently, the JSON that prints out from Solr is:

{
        content: "This is the web page of the web site. This is a subcategory
title for the web page. This is the copy for the web page."
}

3. It's preferred to be more specific such as:

{
        contentTitle: ³<h1>This is the web page of the web site.</h1>², 
        contentSubTitle: ³<h2>This is a subcategory title for the web 
page.</h2>",
        contentBody: ³<p>This is the copy for the web page.</p>"
}

4. Optionally, the JSON could perhaps be like this:

{
        contentTitle: "This is the web page of the web site.²,  
        contentSubTitle: "This is a subcategory title for the web page.",
        contentBody: This is the copy for the web page."
}

I'm guessing this has to occur from augmenting the schema? On a side-note,
this may go back to the inquiry I had earlier about the XPath.

Thanks again,
Mark

On 11/13/13, 1:48 AM, "Markus Jelsma" <[email protected]> wrote:

>I am not sure what you mean. The raw content, including the HTML, is
>stored on disk by default. Each segment has a content directory
>containing just that. But i don't know what you mean by markup as indexed?
> 
>-----Original message-----
>> From:Reyes, Mark <[email protected]>
>> Sent: Wednesday 13th November 2013 2:57
>> To: [email protected]
>> Subject: Preserve HTML that is being crawled from Nutch?
>> 
>> Is there a way to preserve the HTML that is being crawled from Nutch
>>1.7?
>> 
>> Specifically, instead of normalizing the information that is crawled
>>into a long string value then assigning that to the Œcontent¹ key (if
>>viewing in JSON), I¹d like to see the markup itself as indexed.
>> 
>> Thanks,
>> Mark
>> 
>> 
>> IMPORTANT NOTICE: This e-mail message is intended to be received only
>>by persons entitled to receive the confidential information it may
>>contain. E-mail messages sent from Bridgepoint Education may contain
>>information that is confidential and may be legally privileged. Please
>>do not read, copy, forward or store this message unless you are an
>>intended recipient of it. If you received this transmission in error,
>>please notify the sender by reply e-mail and delete the message and any
>>attachments.

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Preserve HTML that is being crawled from Nutch?

Reply via email to