Re: Unable to fetch content

Julien Nioche Thu, 17 Jul 2014 14:14:41 -0700

Hi


On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote:

> Thanks for your reply Julien. I am not doing any indexing and I don’t have
> a solr url. Looks like crawl script requires me to specify a solr url. How
> do I run crawl script without specifying a solar url.


Just comment out the commands related to SOLR in the script and pass it a
dummy parameter for the SOLR url

Also, I want to crawl just the webpage I specify: a depth of 1.

I don’t want to fetch any outlinks.


That can be done by setting db.update.additions.allowed to false in
nutch-site.xml
No new URLs will be added to the crawldb


> How does number of rounds relate to depth? Are they same?


No. They will be the same if there were no redirections and if you were
putting all unfetched URLs in the segments. If there are more unfetched
URLs in the crawldb then you are putting in the segments then you'll
definitely need several iterations.


> If so, what value should I specify for number of rounds to fetch just the
> page I specify and also take care of the redirects. Are http.redirect.max
> and number of rounds related?
>

Set http.redirect.max to a value > 0 so that they redirection gets tried
within the same fetch step (i.e same round).

HTH

Julien



>
> Thanks,
> Vijay
>
> On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]>
> wrote:
>
> > Hi,
> >
> > The crawl command is deprecated, use the crawl script instead and give
> it a
> > number of rounds > 1 so that it has a chance to fetch the redirection
> >
> > J.
> >
> >
> > On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> I am trying to crawl the page at: "
> >> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
> >>
> >> Here’s the parse checker output.
> >>
> >> runtime/local/bin/nutch parsechecker -dumpText
> >> http://0-search.proquest.com.alpha2.latrobe.edu.au/
> >> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> >> Fetch failed with protocol status: temp_moved(13), lastModified=0:
> >>
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> >>
> >> Looks like a redirection and I did a parsechecker again for "
> >>
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> >> ”
> >>
> >> The fetching and parsing was successful this time. I have set
> >> http.redirect.max at 5 and tried to crawl using notch crawl:
> >>
> >> bin/nutch crawl testurl -depth 1
> >>
> >> and did a readseg on the above crawl. Here’s the readseg dump:
> >>
> >> Recno:: 0
> >> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> >>
> >> CrawlDatum::
> >> Version: 7
> >> Status: 1 (db_unfetched)
> >> Fetch time: Wed Jul 16 00:43:28 EDT 2014
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: _ngt_: 1405485810821
> >>
> >> Content::
> >> Version: -1
> >> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> >> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> >> contentType: text/plain
> >> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
> >> nutch.crawl.score=1.0 Location=
> >>
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> >> _fst_=35 nutch.segment.name=20140716004332 Connection=close
> >> Content-Type=text/plain Server=III 150 MIME-version=1.0
> >> Content:
> >>
> >> CrawlDatum::
> >> Version: 7
> >> Status: 35 (fetch_redir_temp)
> >> Fetch time: Wed Jul 16 00:43:38 EDT 2014
> >> Modified time: Wed Dec 31 19:00:00 EST 1969
> >> Retries since fetch: 0
> >> Retry interval: 2592000 seconds (30 days)
> >> Score: 1.0
> >> Signature: null
> >> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
> >> temp_moved(13), lastModified=0:
> >>
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> >>
> >> Not sure why notch didn’t fetch any content or parse any data or text
> when
> >> crawling the page! Did I miss setting some property? I am sure I have
> >> increased the redirect to 5. Using parsechecker, I was able to get the
> data
> >> and text parsed in two steps, so I think max redirect of 5 should be
> >> sufficient. Want to understand why parse checker works and crawl
> doesn’t.
> >>
> >> Thanks,
> >> Vijay
> >
> >
> >
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Unable to fetch content

Reply via email to