Hi Ian,

What fetching depth are you using?

Lewis

On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper <[email protected]> wrote:

> Hi all,
>
> I'd appreciate some guidance... can't seem to find much useful stuff on
> the web on this. I have set up a Nutch and Solr service that is crawling a
> client's site. They have a lot of pages that are accessed with urls like
> this:
>
> http://
> [domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx
>
> The crawler is finding these urls with no problem and pulling their
> contents into the Solr index.
>
> However, many of the pages at these urls also contain links to
> attachments, using .axd extensions. For example, this page:
>
> http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx
>
> has this link in the body:
>
> <p>
>        12 May 2011<br />
>        Download
>        <span id="internal-source-marker_0.1622281443260543">
>                <a href="/medialibrary.axd?id=414405745" target="_self">
>                        An A to Z Guide to Litigation Funding Options
>                </a>
>        </span>(PDF, 401 KB)<br />
>        <span id="internal-source-marker_0.1622281443260543">
>                Julian Chamberlayne, Stewarts Law and
>        </span>
>        David Hartley, Abbey Legal Protection<br />From the ELA Annual
> Conference 2011
> </p>
>
> The problem I'm finding is that the crawler is not apparently visiting or
> indexing the content of these urls. The document at the far end of the link
> has this url
>
> http://[domain]/medialibrary.axd?id=414405745
>
> is actually a pdf. I am using the tika plugin which I thought would allow
> for indexing pdfs.
>
> Anyway, I'd be very grateful for some guidance about how to get Nutch to
> follow these links.
>
> Thanks,
>
>
> Ian.
> --
>
>
>
>
>
> dfiuhspub
>
>


-- 
*Lewis*

Reply via email to