Question on crawling specific content for one page being deep-linked…

- On Nutch 1.7, my crawl is specifically for one page that deep-links such as:
http://www.mywebsite.com/1761.htm#catalog1762
http://www.mywebsite.com/1761.htm#catalog1986
http://www.mywebsite.com/1761.htm#catalog1987

- Currently, the entire document of that page is parsed and returning the JSON 
on Solr such as:
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about 
catalog 1986. Stuff about catalog 1987. Everything at the footer.'
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about 
catalog 1986. Stuff about catalog 1987. Everything at the footer.'
‘content’ : ‘Everything at the header. Stuff about catalog 1762.  Stuff about 
catalog 1986. Stuff about catalog 1987. Everything at the footer.'

- That said, I want the information returned to be based off of where those 
pages point.The HTML for what those links point to are the following:

<a id="catalog1762"></a>
<h2 class="catalog-section-headline”>Catalog 1762</h2>
<span class="catalog-section-text”>
Stuff about catalog 1762.
</span>

<a id="catalog1986"></a>
<h2 class="catalog-section-headline”>Catalog 1986</h2>
<span class="catalog-section-text”>
Stuff about catalog 1986.
</span>

<a id="catalog1987"></a>
<h2 class="catalog-section-headline”>Catalog 1987</h2>
<span class="catalog-section-text”>
Stuff about catalog 1987.
</span>

What would be your recommendation so the JSON that I validate from my Solr 
instance returns those specific h2 and span tags instead?

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Reply via email to