Thanks for your reply Julien. I am not doing any indexing and I don’t have a 
solr url. Looks like crawl script requires me to specify a solr url. How do I 
run crawl script without specifying a solar url. Also, I want to crawl just the 
webpage I specify: a depth of 1. I don’t want to fetch any outlinks. How does 
number of rounds relate to depth? Are they same? If so, what value should I 
specify for number of rounds to fetch just the page I specify and also take 
care of the redirects. Are http.redirect.max and number of rounds related?

Thanks,
Vijay

On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]> 
wrote:

> Hi,
> 
> The crawl command is deprecated, use the crawl script instead and give it a
> number of rounds > 1 so that it has a chance to fetch the redirection
> 
> J.
> 
> 
> On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote:
> 
>> Hi,
>> 
>> I am trying to crawl the page at: "
>> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
>> 
>> Here’s the parse checker output.
>> 
>> runtime/local/bin/nutch parsechecker -dumpText
>> http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> Fetch failed with protocol status: temp_moved(13), lastModified=0:
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>> 
>> Looks like a redirection and I did a parsechecker again for "
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>> ”
>> 
>> The fetching and parsing was successful this time. I have set
>> http.redirect.max at 5 and tried to crawl using notch crawl:
>> 
>> bin/nutch crawl testurl -depth 1
>> 
>> and did a readseg on the above crawl. Here’s the readseg dump:
>> 
>> Recno:: 0
>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Wed Jul 16 00:43:28 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1405485810821
>> 
>> Content::
>> Version: -1
>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> contentType: text/plain
>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
>> nutch.crawl.score=1.0 Location=
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>> _fst_=35 nutch.segment.name=20140716004332 Connection=close
>> Content-Type=text/plain Server=III 150 MIME-version=1.0
>> Content:
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 35 (fetch_redir_temp)
>> Fetch time: Wed Jul 16 00:43:38 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
>> temp_moved(13), lastModified=0:
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>> 
>> Not sure why notch didn’t fetch any content or parse any data or text when
>> crawling the page! Did I miss setting some property? I am sure I have
>> increased the redirect to 5. Using parsechecker, I was able to get the data
>> and text parsed in two steps, so I think max redirect of 5 should be
>> sufficient. Want to understand why parse checker works and crawl doesn’t.
>> 
>> Thanks,
>> Vijay
> 
> 
> 
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to