Re: Can't crawl a domain; can't figure out why.
archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu/archives/timeline/letter1846.html anchor: > [http://libraries.mit.edu/archives/timeline/letter1846.html] outlink: > toUrl: > http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html > anchor: > [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] > outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: > toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf > anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] > outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] > outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: > toUrl: > http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcriptio > n anchor: > [http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcripti > on] outlink: toUrl: > http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association > .html anchor: > [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio > n.html] outlink: toUrl: > http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 > anchor: > [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page > 1] outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: > toUrl: http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf > anchor: [http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf] > outlink: toUrl: > http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html > anchor: > [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] > outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu anchor: [http://libraries.mit.edu] outlink: > toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf > anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] > outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: > toUrl: http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html > anchor: [http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html] > outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: > http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: > [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: > toUrl: > http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 > anchor: > [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page > 1] outlink: toUrl: > http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association > .html anchor: > [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio > n.html] outlink: toUrl: > http://libraries.mit.edu/archives/research/collections/collections-mc/mc1. > html#toc anchor: Return to Table of Contents ╗ Content Metadata: Date=Tue, > 20 Dec 2011 21:30:50 GMT Content-Length=191500 Via=1.0 > barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close > Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from > barracuda.acp.org Server=Apache/2.2.3 (Red Hat) Parse Metadata: > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 > > > > -----Original Message- > From: alx...@aim.com [mailto:alx...@aim.com] > Sent: Tuesday, December 20, 2011 2:15 PM > To: user@nutch.apache.org > Subject: Re: Can't crawl a domain; can't figure out why. > > It seems that robots.txt in > libraries.mit.edu > > > has a lot of restrictions. > > Alex. > > &g
RE: Can't crawl a domain; can't figure out why.
://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription anchor: [http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: http://libraries.mit.edu anchor: [http://libraries.mit.edu] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf] outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html anchor: [http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1] outlink: toUrl: http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html anchor: [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html] outlink: toUrl: http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc anchor: Return to Table of Contents ╗ Content Metadata: Date=Tue, 20 Dec 2011 21:30:50 GMT Content-Length=191500 Via=1.0 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from barracuda.acp.org Server=Apache/2.2.3 (Red Hat) Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Tuesday, December 20, 2011 2:15 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun To: user ; 'markus.jel...@openindex.io' Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl a domain; can't figure out why. I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing pe
Re: Can't crawl a domain; can't figure out why.
It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun To: user ; 'markus.jel...@openindex.io' Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl a domain; can't figure out why. I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or > I'm trying to crawl pages from a number of domains, and one of these > domains has been giving me trouble. The really irritating thing is > that it did work at least once, which led me to believe that I'd > solved the problem. I can't think of anything at this point but to > paste my log of a failed crawl and solrindex and hope that someone can > think of anything I've overlooked. Does anything look strange here? > > Thanks, > Chip > > 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, > indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl > - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO > crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO > crawl.Crawl - threads = 10 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null > 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 > 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at > 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - > Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO > crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 > INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. > 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: > looking > in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: > [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered > Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Basic URL > Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Html Parse Plug-in (parse-html) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Http / Https Protocol Plug-in > (protocol-httpclient) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -HTTP Framework (lib-http) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Pass-through URL Normalizer > (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository > -Http Protocol Plug-in (protocol-http) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Regex URL > Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Tika Parser Plug-in (parse-tika) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -CyberNeko HTML Parser > (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - >Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 > INFO plugin.PluginRepository -URL Meta Indexing Filter > (urlmeta) 2011-12-19 16:31
RE: Can't crawl a domain; can't figure out why.
I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or > I'm trying to crawl pages from a number of domains, and one of these > domains has been giving me trouble. The really irritating thing is > that it did work at least once, which led me to believe that I'd > solved the problem. I can't think of anything at this point but to > paste my log of a failed crawl and solrindex and hope that someone can > think of anything I've overlooked. Does anything look strange here? > > Thanks, > Chip > > 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, > indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl > - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO > crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO > crawl.Crawl - threads = 10 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null > 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 > 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at > 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - > Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO > crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 > INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. > 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: > looking > in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: > [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered > Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Basic URL > Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Html Parse Plug-in (parse-html) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Http / Https Protocol Plug-in > (protocol-httpclient) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -HTTP Framework (lib-http) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Pass-through URL Normalizer > (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository > -Http Protocol Plug-in (protocol-http) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Regex URL > Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Tika Parser Plug-in (parse-tika) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -CyberNeko HTML Parser > (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - >Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 > INFO plugin.PluginRepository -URL Meta Indexing Filter > (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter Framework (lib-regex-filter) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Nutch URL N
Re: Can't crawl a domain; can't figure out why.
Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or > I'm trying to crawl pages from a number of domains, and one of these > domains has been giving me trouble. The really irritating thing is that it > did work at least once, which led me to believe that I'd solved the > problem. I can't think of anything at this point but to paste my log of a > failed crawl and solrindex and hope that someone can think of anything > I've overlooked. Does anything look strange here? > > Thanks, > Chip > > 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing > will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl > started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - > rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - > threads = 10 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 > 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null > 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 > 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at > 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - > Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO > crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO > crawl.Injector - Injector: Converting injected urls to crawl db entries. > 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking > in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: > [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered > Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Basic URL > Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Html Parse Plug-in (parse-html) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Http / Https Protocol Plug-in > (protocol-httpclient) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -HTTP Framework (lib-http) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Pass-through URL Normalizer > (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository > -Http Protocol Plug-in (protocol-http) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Regex URL > Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Tika Parser Plug-in (parse-tika) > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -CyberNeko HTML Parser > (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - >Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 > INFO plugin.PluginRepository -URL Meta Indexing Filter > (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Regex URL Filter Framework (lib-regex-filter) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: > 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - > Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 > 16:31:02,917 INFO plugin.PluginRepository -Nutch Protocol > (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Nutch Segment Merge Filter > (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Nutch URL Filter > (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Nutch Indexing Filter > (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -HTML Parse Filter > (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Nutch Content Parser > (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO > plugin.PluginRepository -Nutch Scoring > (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO > regex.RegexURLNormalizer - can't find rules for scope 'inject', using > default 2011-12-19 16:31:05,72
Can't crawl a domain; can't figure out why.
I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed crawl and solrindex and hope that someone can think of anything I've overlooked. Does anything look strange here? Thanks, Chip 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - threads = 10 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Html Parse Plug-in (parse-html) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTTP Framework (lib-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http Protocol Plug-in (protocol-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Tika Parser Plug-in (parse-tika) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -URL Meta Indexing Filter (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Protocol (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch URL Filter (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Content Parser (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2011-12-19 16:31:05,722 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2011-12-19 16:31:07,014 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-12-19 16:31:07,897 INFO crawl.Injector - Injector: finished at 2011-12-19 16:31:07, elapsed: 00:00:06 2011-12-19 16:31:07,913 INFO crawl.Generator - Generator: starting at 20