Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's site. They have a lot of pages that are accessed with urls like this:
http://[domain]/resources/consultationonstatutoryguidancefordisabilityinequalityact.aspx The crawler is finding these urls with no problem and pulling their contents into the Solr index. However, many of the pages at these urls also contain links to attachments, using .axd extensions. For example, this page: http://[domain]/resources/anatozguidetolitigationfundingoptions.aspx has this link in the body: <p> 12 May 2011<br /> Download <span id="internal-source-marker_0.1622281443260543"> <a href="/medialibrary.axd?id=414405745" target="_self"> An A to Z Guide to Litigation Funding Options </a> </span>(PDF, 401 KB)<br /> <span id="internal-source-marker_0.1622281443260543"> Julian Chamberlayne, Stewarts Law and </span> David Hartley, Abbey Legal Protection<br />From the ELA Annual Conference 2011 </p> The problem I'm finding is that the crawler is not apparently visiting or indexing the content of these urls. The document at the far end of the link has this url http://[domain]/medialibrary.axd?id=414405745 is actually a pdf. I am using the tika plugin which I thought would allow for indexing pdfs. Anyway, I'd be very grateful for some guidance about how to get Nutch to follow these links. Thanks, Ian. -- dfiuhspub

