Hi all,

I'd appreciate some guidance... can't seem to find much useful stuff on the web 
on this. I have set up a Nutch and Solr service that is crawling a client's 
site. They have a lot of pages that are accessed with urls like this:

http://[domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx

The crawler is finding these urls with no problem and pulling their contents 
into the Solr index.

However, many of the pages at these urls also contain links to attachments, 
using .axd extensions. For example, this page:

http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx

has this link in the body:

<p>
        12 May 2011<br />
        Download 
        <span id="internal-source-marker_0.1622281443260543">
                <a href="/medialibrary.axd?id=414405745" target="_self">
                        An A to Z Guide to Litigation Funding Options 
                </a>
        </span>(PDF, 401 KB)<br />
        <span id="internal-source-marker_0.1622281443260543">
                Julian Chamberlayne, Stewarts Law and 
        </span>
        David Hartley, Abbey Legal Protection<br />From the ELA Annual 
Conference 2011
</p>

The problem I'm finding is that the crawler is not apparently visiting or 
indexing the content of these urls. The document at the far end of the link has 
this url

http://[domain]/medialibrary.axd?id=414405745

is actually a pdf. I am using the tika plugin which I thought would allow for 
indexing pdfs.

Anyway, I'd be very grateful for some guidance about how to get Nutch to follow 
these links.

Thanks,


Ian.
--





dfiuhspub

Reply via email to