Crawling Websites for Links

Teague James Tue, 21 Jan 2014 12:38:50 -0800

At the suggestion of Markus I have moved from the Solr list to the Nutch
list...


Basically I am trying ti get Nutch to crawl particular websites in order for
Solr to consume the data that Nutch provides so as to be able to search for
documents that are linked to by the site. The site has several links to PDF
documents and I want to be able to index this with Solr so that the anchors
and associated URLs are in the Solr index. Thus far I have only been
successful in capturing some of the anchors and none of the URLs that they
link to. The thread I started in the Solr list is below.

@Markus: When you say that the problem may be with url filters, what can I
do about that? How do I dump the linkdb to inspect it for URLs? I appreciate
all the help you've offered thus far!

Hi, are you getting pdfs at all? Sounds like a problem with url filters,
those also work on the linkdb. You should also try dumping the linkdb and
inspect it for urls.

Btw, i noticed this is om the solr list, its best to open a new discussion
on the nutch user mailing list.

Cheers

Teague James <[email protected]> schreef:What I'm getting is just the
anchor text. In cases where there are multiple anchors I am getting a comma
separated list of anchor text - which is fine. However, I am not getting all
of the anchors that are on the page, nor am I getting any of the URLs. The
anchors I am getting back never include anchors that lead to documents -
which is the primary objective. So on a page that looks something like:

Article 1 text blah blah blah [Read more] Article 2 test blah blah blah
[Read more] Download a the [PDF]

Where each [Read more] links to a page where the rest of the article is
stored and [PDF] links to a PDF document (these are relative links). That I
get back in the anchor field is "[Read more]","[Read more]"

I am not getting the "[PDF]" anchor and I am not getting any of the URLs
that those anchors point to - like "/Artilce 1", "/Article 2", and
"/documents/Article 1.pdf"

How can I get these URLs?

Crawling Websites for Links

Reply via email to