Re: Unable to fetch content

Vijay Chakilam Fri, 18 Jul 2014 12:12:40 -0700

Any help would be great. I even tried to do step by step. I first injected the 
url, generated the sgement, fetched and parsed it. The readseg dump is the 
same. It doesn’t have any content, data or text. The thing I am not able to 
understand is that some pages that have redirects are fetched, but some others 
are not. For example: http://rust.wikia.com/. This url has a redirect to 
http://rust.wikia.com/wiki/Rust_Wiki.
The difference I see is that of the “status”. For http://rust.wikia.com/, the 
fetch status is “moved(12)", whereas for 
http://0-search.proquest.com.alpha2.latrobe.edu.au/, it is "tmp_moved(13)”.


Below are the parsechecker outputs for both the urls:

runtime/local/bin/nutch parsechecker -dumpText http://rust.wikia.com/
fetching: http://rust.wikia.com/
Fetch failed with protocol status: moved(12), lastModified=0: 
http://rust.wikia.com/wiki/Rust_Wiki

runtime/local/bin/nutch parsechecker -dumpText 
http://0-search.proquest.com.alpha2.latrobe.edu.au/
fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
Fetch failed with protocol status: temp_moved(13), lastModified=0: 
https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F

Please help me understand the difference between moved(12) and tmp_moved(13) 
and help solve the problem to be able to crawl such pages.

Thanks,
Vijay

On Jul 17, 2014, at 6:32 PM, Vijay Chakilam <[email protected]> wrote:

> Thanks for your answers Julien. I tried to use the crawl script, but I am 
> having the same problem. I have set redirect.max to 5 and number of rounds 1 
> (I have also tried 2 rounds, but I guess that doesn’t help since I have 
> already specified the redirect.max to be 5. So it should follow any redirects 
> even with 1 round, right?) Here’s the new readseg dump:
> 
> Recno:: 0
> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> 
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Jul 17 18:20:54 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: 
>       _ngt_=1405635658639
> 
> Content::
> Version: -1
> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> contentType: text/plain
> metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent 
> nutch.crawl.score=1.0 
> Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>  _fst_=35 nutch.segment.name=20140717182101 Connection=close 
> Content-Type=text/plain Server=III 150 MIME-version=1.0 
> Content:
> 
> CrawlDatum::
> Version: 7
> Status: 35 (fetch_redir_temp)
> Fetch time: Thu Jul 17 18:21:08 EDT 2014
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: 
>       _ngt_=1405635658639
>       Content-Type=text/plain
>       _pst_=temp_moved(13), lastModified=0: 
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>       _rs_=468
> 
> Thanks,
> Vijay
> 
> On Jul 17, 2014, at 5:13 PM, Julien Nioche <[email protected]> 
> wrote:
> 
>> Hi
>> 
>> 
>> On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote:
>> 
>>> Thanks for your reply Julien. I am not doing any indexing and I don’t have
>>> a solr url. Looks like crawl script requires me to specify a solr url. How
>>> do I run crawl script without specifying a solar url.
>> 
>> 
>> Just comment out the commands related to SOLR in the script and pass it a
>> dummy parameter for the SOLR url
>> 
>> Also, I want to crawl just the webpage I specify: a depth of 1.
>> 
>> I don’t want to fetch any outlinks.
>> 
>> 
>> That can be done by setting db.update.additions.allowed to false in
>> nutch-site.xml
>> No new URLs will be added to the crawldb
>> 
>> 
>>> How does number of rounds relate to depth? Are they same?
>> 
>> 
>> No. They will be the same if there were no redirections and if you were
>> putting all unfetched URLs in the segments. If there are more unfetched
>> URLs in the crawldb then you are putting in the segments then you'll
>> definitely need several iterations.
>> 
>> 
>>> If so, what value should I specify for number of rounds to fetch just the
>>> page I specify and also take care of the redirects. Are http.redirect.max
>>> and number of rounds related?
>>> 
>> 
>> Set http.redirect.max to a value > 0 so that they redirection gets tried
>> within the same fetch step (i.e same round).
>> 
>> HTH
>> 
>> Julien
>> 
>> 
>> 
>>> 
>>> Thanks,
>>> Vijay
>>> 
>>> On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> The crawl command is deprecated, use the crawl script instead and give
>>> it a
>>>> number of rounds > 1 so that it has a chance to fetch the redirection
>>>> 
>>>> J.
>>>> 
>>>> 
>>>> On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am trying to crawl the page at: "
>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
>>>>> 
>>>>> Here’s the parse checker output.
>>>>> 
>>>>> runtime/local/bin/nutch parsechecker -dumpText
>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>> Fetch failed with protocol status: temp_moved(13), lastModified=0:
>>>>> 
>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>> 
>>>>> Looks like a redirection and I did a parsechecker again for "
>>>>> 
>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>> ”
>>>>> 
>>>>> The fetching and parsing was successful this time. I have set
>>>>> http.redirect.max at 5 and tried to crawl using notch crawl:
>>>>> 
>>>>> bin/nutch crawl testurl -depth 1
>>>>> 
>>>>> and did a readseg on the above crawl. Here’s the readseg dump:
>>>>> 
>>>>> Recno:: 0
>>>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>> 
>>>>> CrawlDatum::
>>>>> Version: 7
>>>>> Status: 1 (db_unfetched)
>>>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014
>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>>> Retries since fetch: 0
>>>>> Retry interval: 2592000 seconds (30 days)
>>>>> Score: 1.0
>>>>> Signature: null
>>>>> Metadata: _ngt_: 1405485810821
>>>>> 
>>>>> Content::
>>>>> Version: -1
>>>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>> contentType: text/plain
>>>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
>>>>> nutch.crawl.score=1.0 Location=
>>>>> 
>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close
>>>>> Content-Type=text/plain Server=III 150 MIME-version=1.0
>>>>> Content:
>>>>> 
>>>>> CrawlDatum::
>>>>> Version: 7
>>>>> Status: 35 (fetch_redir_temp)
>>>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014
>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>>> Retries since fetch: 0
>>>>> Retry interval: 2592000 seconds (30 days)
>>>>> Score: 1.0
>>>>> Signature: null
>>>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
>>>>> temp_moved(13), lastModified=0:
>>>>> 
>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>> 
>>>>> Not sure why notch didn’t fetch any content or parse any data or text
>>> when
>>>>> crawling the page! Did I miss setting some property? I am sure I have
>>>>> increased the redirect to 5. Using parsechecker, I was able to get the
>>> data
>>>>> and text parsed in two steps, so I think max redirect of 5 should be
>>>>> sufficient. Want to understand why parse checker works and crawl
>>> doesn’t.
>>>>> 
>>>>> Thanks,
>>>>> Vijay
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> Open Source Solutions for Text Engineering
>>>> 
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble
>>> 
>>> 
>> 
>> 
>> -- 
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>

Re: Unable to fetch content

Reply via email to