Hi,

The crawl command is deprecated, use the crawl script instead and give it a
number of rounds > 1 so that it has a chance to fetch the redirection

J.


On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote:

> Hi,
>
> I am trying to crawl the page at: "
> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
>
> Here’s the parse checker output.
>
> runtime/local/bin/nutch parsechecker -dumpText
> http://0-search.proquest.com.alpha2.latrobe.edu.au/
> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> Fetch failed with protocol status: temp_moved(13), lastModified=0:
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>
> Looks like a redirection and I did a parsechecker again for "
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> ”
>
> The fetching and parsing was successful this time. I have set
> http.redirect.max at 5 and tried to crawl using notch crawl:
>
> bin/nutch crawl testurl -depth 1
>
> and did a readseg on the above crawl. Here’s the readseg dump:
>
> Recno:: 0
> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Wed Jul 16 00:43:28 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1405485810821
>
> Content::
> Version: -1
> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> contentType: text/plain
> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
> nutch.crawl.score=1.0 Location=
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> _fst_=35 nutch.segment.name=20140716004332 Connection=close
> Content-Type=text/plain Server=III 150 MIME-version=1.0
> Content:
>
> CrawlDatum::
> Version: 7
> Status: 35 (fetch_redir_temp)
> Fetch time: Wed Jul 16 00:43:38 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
> temp_moved(13), lastModified=0:
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>
> Not sure why notch didn’t fetch any content or parse any data or text when
> crawling the page! Did I miss setting some property? I am sure I have
> increased the redirect to 5. Using parsechecker, I was able to get the data
> and text parsed in two steps, so I think max redirect of 5 should be
> sufficient. Want to understand why parse checker works and crawl doesn’t.
>
> Thanks,
> Vijay




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to