Any help would be great. I even tried to do step by step. I first injected the url, generated the sgement, fetched and parsed it. The readseg dump is the same. It doesn’t have any content, data or text. The thing I am not able to understand is that some pages that have redirects are fetched, but some others are not. For example: http://rust.wikia.com/. This url has a redirect to http://rust.wikia.com/wiki/Rust_Wiki. The difference I see is that of the “status”. For http://rust.wikia.com/, the fetch status is “moved(12)", whereas for http://0-search.proquest.com.alpha2.latrobe.edu.au/, it is "tmp_moved(13)”.
Below are the parsechecker outputs for both the urls: runtime/local/bin/nutch parsechecker -dumpText http://rust.wikia.com/ fetching: http://rust.wikia.com/ Fetch failed with protocol status: moved(12), lastModified=0: http://rust.wikia.com/wiki/Rust_Wiki runtime/local/bin/nutch parsechecker -dumpText http://0-search.proquest.com.alpha2.latrobe.edu.au/ fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ Fetch failed with protocol status: temp_moved(13), lastModified=0: https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F Please help me understand the difference between moved(12) and tmp_moved(13) and help solve the problem to be able to crawl such pages. Thanks, Vijay On Jul 17, 2014, at 6:32 PM, Vijay Chakilam <[email protected]> wrote: > Thanks for your answers Julien. I tried to use the crawl script, but I am > having the same problem. I have set redirect.max to 5 and number of rounds 1 > (I have also tried 2 rounds, but I guess that doesn’t help since I have > already specified the redirect.max to be 5. So it should follow any redirects > even with 1 round, right?) Here’s the new readseg dump: > > Recno:: 0 > URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Thu Jul 17 18:20:54 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1405635658639 > > Content:: > Version: -1 > url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > contentType: text/plain > metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent > nutch.crawl.score=1.0 > Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > _fst_=35 nutch.segment.name=20140717182101 Connection=close > Content-Type=text/plain Server=III 150 MIME-version=1.0 > Content: > > CrawlDatum:: > Version: 7 > Status: 35 (fetch_redir_temp) > Fetch time: Thu Jul 17 18:21:08 EDT 2014 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: > _ngt_=1405635658639 > Content-Type=text/plain > _pst_=temp_moved(13), lastModified=0: > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > _rs_=468 > > Thanks, > Vijay > > On Jul 17, 2014, at 5:13 PM, Julien Nioche <[email protected]> > wrote: > >> Hi >> >> >> On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote: >> >>> Thanks for your reply Julien. I am not doing any indexing and I don’t have >>> a solr url. Looks like crawl script requires me to specify a solr url. How >>> do I run crawl script without specifying a solar url. >> >> >> Just comment out the commands related to SOLR in the script and pass it a >> dummy parameter for the SOLR url >> >> Also, I want to crawl just the webpage I specify: a depth of 1. >> >> I don’t want to fetch any outlinks. >> >> >> That can be done by setting db.update.additions.allowed to false in >> nutch-site.xml >> No new URLs will be added to the crawldb >> >> >>> How does number of rounds relate to depth? Are they same? >> >> >> No. They will be the same if there were no redirections and if you were >> putting all unfetched URLs in the segments. If there are more unfetched >> URLs in the crawldb then you are putting in the segments then you'll >> definitely need several iterations. >> >> >>> If so, what value should I specify for number of rounds to fetch just the >>> page I specify and also take care of the redirects. Are http.redirect.max >>> and number of rounds related? >>> >> >> Set http.redirect.max to a value > 0 so that they redirection gets tried >> within the same fetch step (i.e same round). >> >> HTH >> >> Julien >> >> >> >>> >>> Thanks, >>> Vijay >>> >>> On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> The crawl command is deprecated, use the crawl script instead and give >>> it a >>>> number of rounds > 1 so that it has a chance to fetch the redirection >>>> >>>> J. >>>> >>>> >>>> On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am trying to crawl the page at: " >>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/" >>>>> >>>>> Here’s the parse checker output. >>>>> >>>>> runtime/local/bin/nutch parsechecker -dumpText >>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>> Fetch failed with protocol status: temp_moved(13), lastModified=0: >>>>> >>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>> >>>>> Looks like a redirection and I did a parsechecker again for " >>>>> >>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>> ” >>>>> >>>>> The fetching and parsing was successful this time. I have set >>>>> http.redirect.max at 5 and tried to crawl using notch crawl: >>>>> >>>>> bin/nutch crawl testurl -depth 1 >>>>> >>>>> and did a readseg on the above crawl. Here’s the readseg dump: >>>>> >>>>> Recno:: 0 >>>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>> >>>>> CrawlDatum:: >>>>> Version: 7 >>>>> Status: 1 (db_unfetched) >>>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014 >>>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>>> Retries since fetch: 0 >>>>> Retry interval: 2592000 seconds (30 days) >>>>> Score: 1.0 >>>>> Signature: null >>>>> Metadata: _ngt_: 1405485810821 >>>>> >>>>> Content:: >>>>> Version: -1 >>>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>> contentType: text/plain >>>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent >>>>> nutch.crawl.score=1.0 Location= >>>>> >>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close >>>>> Content-Type=text/plain Server=III 150 MIME-version=1.0 >>>>> Content: >>>>> >>>>> CrawlDatum:: >>>>> Version: 7 >>>>> Status: 35 (fetch_redir_temp) >>>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014 >>>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>>> Retries since fetch: 0 >>>>> Retry interval: 2592000 seconds (30 days) >>>>> Score: 1.0 >>>>> Signature: null >>>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: >>>>> temp_moved(13), lastModified=0: >>>>> >>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>> >>>>> Not sure why notch didn’t fetch any content or parse any data or text >>> when >>>>> crawling the page! Did I miss setting some property? I am sure I have >>>>> increased the redirect to 5. Using parsechecker, I was able to get the >>> data >>>>> and text parsed in two steps, so I think max redirect of 5 should be >>>>> sufficient. Want to understand why parse checker works and crawl >>> doesn’t. >>>>> >>>>> Thanks, >>>>> Vijay >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble >>> >>> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >

