Hi Ian, What fetching depth are you using?
Lewis On Mon, Jan 23, 2012 at 7:46 AM, Ian Piper <[email protected]> wrote: > Hi all, > > I'd appreciate some guidance... can't seem to find much useful stuff on > the web on this. I have set up a Nutch and Solr service that is crawling a > client's site. They have a lot of pages that are accessed with urls like > this: > > http:// > [domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx > > The crawler is finding these urls with no problem and pulling their > contents into the Solr index. > > However, many of the pages at these urls also contain links to > attachments, using .axd extensions. For example, this page: > > http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx > > has this link in the body: > > <p> > 12 May 2011<br /> > Download > <span id="internal-source-marker_0.1622281443260543"> > <a href="/medialibrary.axd?id=414405745" target="_self"> > An A to Z Guide to Litigation Funding Options > </a> > </span>(PDF, 401 KB)<br /> > <span id="internal-source-marker_0.1622281443260543"> > Julian Chamberlayne, Stewarts Law and > </span> > David Hartley, Abbey Legal Protection<br />From the ELA Annual > Conference 2011 > </p> > > The problem I'm finding is that the crawler is not apparently visiting or > indexing the content of these urls. The document at the far end of the link > has this url > > http://[domain]/medialibrary.axd?id=414405745 > > is actually a pdf. I am using the tika plugin which I thought would allow > for indexing pdfs. > > Anyway, I'd be very grateful for some guidance about how to get Nutch to > follow these links. > > Thanks, > > > Ian. > -- > > > > > > dfiuhspub > > -- *Lewis*

