Re: Unable to fetch content

Vijay Chakilam Fri, 18 Jul 2014 12:50:09 -0700

I think I got it. The redirect page url was rejected by the 
regex-urlfilter.txt. I edited it and the content is fetched fine.


Thanks,
Vijay

On Jul 18, 2014, at 3:13 PM, Vijay Chakilam <[email protected]> wrote:

> Any help would be great. I even tried to do step by step. I first injected 
> the url, generated the sgement, fetched and parsed it. The readseg dump is 
> the same. It doesn’t have any content, data or text. The thing I am not able 
> to understand is that some pages that have redirects are fetched, but some 
> others are not. For example: http://rust.wikia.com/. This url has a redirect 
> to http://rust.wikia.com/wiki/Rust_Wiki.
> The difference I see is that of the “status”. For http://rust.wikia.com/, the 
> fetch status is “moved(12)", whereas for 
> http://0-search.proquest.com.alpha2.latrobe.edu.au/, it is "tmp_moved(13)”.
> 
> Below are the parsechecker outputs for both the urls:
> 
> runtime/local/bin/nutch parsechecker -dumpText http://rust.wikia.com/
> fetching: http://rust.wikia.com/
> Fetch failed with protocol status: moved(12), lastModified=0: 
> http://rust.wikia.com/wiki/Rust_Wiki
> 
> runtime/local/bin/nutch parsechecker -dumpText 
> http://0-search.proquest.com.alpha2.latrobe.edu.au/
> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
> Fetch failed with protocol status: temp_moved(13), lastModified=0: 
> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
> 
> Please help me understand the difference between moved(12) and tmp_moved(13) 
> and help solve the problem to be able to crawl such pages.
> 
> Thanks,
> Vijay
> 
> On Jul 17, 2014, at 6:32 PM, Vijay Chakilam <[email protected]> wrote:
> 
>> Thanks for your answers Julien. I tried to use the crawl script, but I am 
>> having the same problem. I have set redirect.max to 5 and number of rounds 1 
>> (I have also tried 2 rounds, but I guess that doesn’t help since I have 
>> already specified the redirect.max to be 5. So it should follow any 
>> redirects even with 1 round, right?) Here’s the new readseg dump:
>> 
>> Recno:: 0
>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 1 (db_unfetched)
>> Fetch time: Thu Jul 17 18:20:54 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: 
>>      _ngt_=1405635658639
>> 
>> Content::
>> Version: -1
>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>> contentType: text/plain
>> metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent 
>> nutch.crawl.score=1.0 
>> Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>  _fst_=35 nutch.segment.name=20140717182101 Connection=close 
>> Content-Type=text/plain Server=III 150 MIME-version=1.0 
>> Content:
>> 
>> CrawlDatum::
>> Version: 7
>> Status: 35 (fetch_redir_temp)
>> Fetch time: Thu Jul 17 18:21:08 EDT 2014
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 2592000 seconds (30 days)
>> Score: 1.0
>> Signature: null
>> Metadata: 
>>      _ngt_=1405635658639
>>      Content-Type=text/plain
>>      _pst_=temp_moved(13), lastModified=0: 
>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>      _rs_=468
>> 
>> Thanks,
>> Vijay
>> 
>> On Jul 17, 2014, at 5:13 PM, Julien Nioche <[email protected]> 
>> wrote:
>> 
>>> Hi
>>> 
>>> 
>>> On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote:
>>> 
>>>> Thanks for your reply Julien. I am not doing any indexing and I don’t have
>>>> a solr url. Looks like crawl script requires me to specify a solr url. How
>>>> do I run crawl script without specifying a solar url.
>>> 
>>> 
>>> Just comment out the commands related to SOLR in the script and pass it a
>>> dummy parameter for the SOLR url
>>> 
>>> Also, I want to crawl just the webpage I specify: a depth of 1.
>>> 
>>> I don’t want to fetch any outlinks.
>>> 
>>> 
>>> That can be done by setting db.update.additions.allowed to false in
>>> nutch-site.xml
>>> No new URLs will be added to the crawldb
>>> 
>>> 
>>>> How does number of rounds relate to depth? Are they same?
>>> 
>>> 
>>> No. They will be the same if there were no redirections and if you were
>>> putting all unfetched URLs in the segments. If there are more unfetched
>>> URLs in the crawldb then you are putting in the segments then you'll
>>> definitely need several iterations.
>>> 
>>> 
>>>> If so, what value should I specify for number of rounds to fetch just the
>>>> page I specify and also take care of the redirects. Are http.redirect.max
>>>> and number of rounds related?
>>>> 
>>> 
>>> Set http.redirect.max to a value > 0 so that they redirection gets tried
>>> within the same fetch step (i.e same round).
>>> 
>>> HTH
>>> 
>>> Julien
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Vijay
>>>> 
>>>> On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> The crawl command is deprecated, use the crawl script instead and give
>>>> it a
>>>>> number of rounds > 1 so that it has a chance to fetch the redirection
>>>>> 
>>>>> J.
>>>>> 
>>>>> 
>>>>> On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I am trying to crawl the page at: "
>>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/";
>>>>>> 
>>>>>> Here’s the parse checker output.
>>>>>> 
>>>>>> runtime/local/bin/nutch parsechecker -dumpText
>>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>>> Fetch failed with protocol status: temp_moved(13), lastModified=0:
>>>>>> 
>>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>>> 
>>>>>> Looks like a redirection and I did a parsechecker again for "
>>>>>> 
>>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>>> ”
>>>>>> 
>>>>>> The fetching and parsing was successful this time. I have set
>>>>>> http.redirect.max at 5 and tried to crawl using notch crawl:
>>>>>> 
>>>>>> bin/nutch crawl testurl -depth 1
>>>>>> 
>>>>>> and did a readseg on the above crawl. Here’s the readseg dump:
>>>>>> 
>>>>>> Recno:: 0
>>>>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>>> 
>>>>>> CrawlDatum::
>>>>>> Version: 7
>>>>>> Status: 1 (db_unfetched)
>>>>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014
>>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>>>> Retries since fetch: 0
>>>>>> Retry interval: 2592000 seconds (30 days)
>>>>>> Score: 1.0
>>>>>> Signature: null
>>>>>> Metadata: _ngt_: 1405485810821
>>>>>> 
>>>>>> Content::
>>>>>> Version: -1
>>>>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/
>>>>>> contentType: text/plain
>>>>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent
>>>>>> nutch.crawl.score=1.0 Location=
>>>>>> 
>>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close
>>>>>> Content-Type=text/plain Server=III 150 MIME-version=1.0
>>>>>> Content:
>>>>>> 
>>>>>> CrawlDatum::
>>>>>> Version: 7
>>>>>> Status: 35 (fetch_redir_temp)
>>>>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014
>>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969
>>>>>> Retries since fetch: 0
>>>>>> Retry interval: 2592000 seconds (30 days)
>>>>>> Score: 1.0
>>>>>> Signature: null
>>>>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_:
>>>>>> temp_moved(13), lastModified=0:
>>>>>> 
>>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F
>>>>>> 
>>>>>> Not sure why notch didn’t fetch any content or parse any data or text
>>>> when
>>>>>> crawling the page! Did I miss setting some property? I am sure I have
>>>>>> increased the redirect to 5. Using parsechecker, I was able to get the
>>>> data
>>>>>> and text parsed in two steps, so I think max redirect of 5 should be
>>>>>> sufficient. Want to understand why parse checker works and crawl
>>>> doesn’t.
>>>>>> 
>>>>>> Thanks,
>>>>>> Vijay
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Open Source Solutions for Text Engineering
>>>>> 
>>>>> http://digitalpebble.blogspot.com/
>>>>> http://www.digitalpebble.com
>>>>> http://twitter.com/digitalpebble
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>> 
>

Re: Unable to fetch content

Reply via email to