I think I got it. The redirect page url was rejected by the regex-urlfilter.txt. I edited it and the content is fetched fine.
Thanks, Vijay On Jul 18, 2014, at 3:13 PM, Vijay Chakilam <[email protected]> wrote: > Any help would be great. I even tried to do step by step. I first injected > the url, generated the sgement, fetched and parsed it. The readseg dump is > the same. It doesn’t have any content, data or text. The thing I am not able > to understand is that some pages that have redirects are fetched, but some > others are not. For example: http://rust.wikia.com/. This url has a redirect > to http://rust.wikia.com/wiki/Rust_Wiki. > The difference I see is that of the “status”. For http://rust.wikia.com/, the > fetch status is “moved(12)", whereas for > http://0-search.proquest.com.alpha2.latrobe.edu.au/, it is "tmp_moved(13)”. > > Below are the parsechecker outputs for both the urls: > > runtime/local/bin/nutch parsechecker -dumpText http://rust.wikia.com/ > fetching: http://rust.wikia.com/ > Fetch failed with protocol status: moved(12), lastModified=0: > http://rust.wikia.com/wiki/Rust_Wiki > > runtime/local/bin/nutch parsechecker -dumpText > http://0-search.proquest.com.alpha2.latrobe.edu.au/ > fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ > Fetch failed with protocol status: temp_moved(13), lastModified=0: > https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F > > Please help me understand the difference between moved(12) and tmp_moved(13) > and help solve the problem to be able to crawl such pages. > > Thanks, > Vijay > > On Jul 17, 2014, at 6:32 PM, Vijay Chakilam <[email protected]> wrote: > >> Thanks for your answers Julien. I tried to use the crawl script, but I am >> having the same problem. I have set redirect.max to 5 and number of rounds 1 >> (I have also tried 2 rounds, but I guess that doesn’t help since I have >> already specified the redirect.max to be 5. So it should follow any >> redirects even with 1 round, right?) Here’s the new readseg dump: >> >> Recno:: 0 >> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> >> CrawlDatum:: >> Version: 7 >> Status: 1 (db_unfetched) >> Fetch time: Thu Jul 17 18:20:54 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: >> _ngt_=1405635658639 >> >> Content:: >> Version: -1 >> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >> contentType: text/plain >> metadata: Date=Thu, 17 Jul 2014 22:21:08 GMT Vary=User-Agent >> nutch.crawl.score=1.0 >> Location=https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> _fst_=35 nutch.segment.name=20140717182101 Connection=close >> Content-Type=text/plain Server=III 150 MIME-version=1.0 >> Content: >> >> CrawlDatum:: >> Version: 7 >> Status: 35 (fetch_redir_temp) >> Fetch time: Thu Jul 17 18:21:08 EDT 2014 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 2592000 seconds (30 days) >> Score: 1.0 >> Signature: null >> Metadata: >> _ngt_=1405635658639 >> Content-Type=text/plain >> _pst_=temp_moved(13), lastModified=0: >> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >> _rs_=468 >> >> Thanks, >> Vijay >> >> On Jul 17, 2014, at 5:13 PM, Julien Nioche <[email protected]> >> wrote: >> >>> Hi >>> >>> >>> On 17 July 2014 22:04, Vijay Chakilam <[email protected]> wrote: >>> >>>> Thanks for your reply Julien. I am not doing any indexing and I don’t have >>>> a solr url. Looks like crawl script requires me to specify a solr url. How >>>> do I run crawl script without specifying a solar url. >>> >>> >>> Just comment out the commands related to SOLR in the script and pass it a >>> dummy parameter for the SOLR url >>> >>> Also, I want to crawl just the webpage I specify: a depth of 1. >>> >>> I don’t want to fetch any outlinks. >>> >>> >>> That can be done by setting db.update.additions.allowed to false in >>> nutch-site.xml >>> No new URLs will be added to the crawldb >>> >>> >>>> How does number of rounds relate to depth? Are they same? >>> >>> >>> No. They will be the same if there were no redirections and if you were >>> putting all unfetched URLs in the segments. If there are more unfetched >>> URLs in the crawldb then you are putting in the segments then you'll >>> definitely need several iterations. >>> >>> >>>> If so, what value should I specify for number of rounds to fetch just the >>>> page I specify and also take care of the redirects. Are http.redirect.max >>>> and number of rounds related? >>>> >>> >>> Set http.redirect.max to a value > 0 so that they redirection gets tried >>> within the same fetch step (i.e same round). >>> >>> HTH >>> >>> Julien >>> >>> >>> >>>> >>>> Thanks, >>>> Vijay >>>> >>>> On Jul 17, 2014, at 4:42 PM, Julien Nioche <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The crawl command is deprecated, use the crawl script instead and give >>>> it a >>>>> number of rounds > 1 so that it has a chance to fetch the redirection >>>>> >>>>> J. >>>>> >>>>> >>>>> On 17 July 2014 21:10, Vijay Chakilam <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to crawl the page at: " >>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/" >>>>>> >>>>>> Here’s the parse checker output. >>>>>> >>>>>> runtime/local/bin/nutch parsechecker -dumpText >>>>>> http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>>> fetching: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>>> Fetch failed with protocol status: temp_moved(13), lastModified=0: >>>>>> >>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>>> >>>>>> Looks like a redirection and I did a parsechecker again for " >>>>>> >>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>>> ” >>>>>> >>>>>> The fetching and parsing was successful this time. I have set >>>>>> http.redirect.max at 5 and tried to crawl using notch crawl: >>>>>> >>>>>> bin/nutch crawl testurl -depth 1 >>>>>> >>>>>> and did a readseg on the above crawl. Here’s the readseg dump: >>>>>> >>>>>> Recno:: 0 >>>>>> URL:: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>>> >>>>>> CrawlDatum:: >>>>>> Version: 7 >>>>>> Status: 1 (db_unfetched) >>>>>> Fetch time: Wed Jul 16 00:43:28 EDT 2014 >>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>>>> Retries since fetch: 0 >>>>>> Retry interval: 2592000 seconds (30 days) >>>>>> Score: 1.0 >>>>>> Signature: null >>>>>> Metadata: _ngt_: 1405485810821 >>>>>> >>>>>> Content:: >>>>>> Version: -1 >>>>>> url: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>>> base: http://0-search.proquest.com.alpha2.latrobe.edu.au/ >>>>>> contentType: text/plain >>>>>> metadata: Date=Wed, 16 Jul 2014 04:43:37 GMT Vary=User-Agent >>>>>> nutch.crawl.score=1.0 Location= >>>>>> >>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>>> _fst_=35 nutch.segment.name=20140716004332 Connection=close >>>>>> Content-Type=text/plain Server=III 150 MIME-version=1.0 >>>>>> Content: >>>>>> >>>>>> CrawlDatum:: >>>>>> Version: 7 >>>>>> Status: 35 (fetch_redir_temp) >>>>>> Fetch time: Wed Jul 16 00:43:38 EDT 2014 >>>>>> Modified time: Wed Dec 31 19:00:00 EST 1969 >>>>>> Retries since fetch: 0 >>>>>> Retry interval: 2592000 seconds (30 days) >>>>>> Score: 1.0 >>>>>> Signature: null >>>>>> Metadata: _ngt_: 1405485810821Content-Type: text/plain_pst_: >>>>>> temp_moved(13), lastModified=0: >>>>>> >>>> https://alpha2.latrobe.edu.au:443/validate?url=http%3A%2F%2F0-search.proquest.com.alpha2.latrobe.edu.au%3A80%2F >>>>>> >>>>>> Not sure why notch didn’t fetch any content or parse any data or text >>>> when >>>>>> crawling the page! Did I miss setting some property? I am sure I have >>>>>> increased the redirect to 5. Using parsechecker, I was able to get the >>>> data >>>>>> and text parsed in two steps, so I think max redirect of 5 should be >>>>>> sufficient. Want to understand why parse checker works and crawl >>>> doesn’t. >>>>>> >>>>>> Thanks, >>>>>> Vijay >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.com/ >>>>> http://www.digitalpebble.com >>>>> http://twitter.com/digitalpebble >>>> >>>> >>> >>> >>> -- >>> >>> Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >> >

