subject:"Can't crawl a domain; can't figure out why."

Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Markus Jelsma

archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu/archives/timeline/letter1846.html anchor:
> [http://libraries.mit.edu/archives/timeline/letter1846.html] outlink:
> toUrl:
> http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html
> anchor:
> [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
> outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
> toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf
> anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
> outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
> outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
> toUrl:
> http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcriptio
> n anchor:
> [http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcripti
> on] outlink: toUrl:
> http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association
> .html anchor:
> [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio
> n.html] outlink: toUrl:
> http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1
> anchor:
> [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page
> 1] outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
> toUrl: http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf
> anchor: [http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf]
> outlink: toUrl:
> http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html
> anchor:
> [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
> outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu anchor: [http://libraries.mit.edu] outlink:
> toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf
> anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
> outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
> toUrl: http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html
> anchor: [http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html]
> outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
> http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
> [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
> toUrl:
> http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1
> anchor:
> [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page
> 1] outlink: toUrl:
> http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association
> .html anchor:
> [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio
> n.html] outlink: toUrl:
> http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
> html#toc anchor: Return to Table of Contents ╗ Content Metadata: Date=Tue,
> 20 Dec 2011 21:30:50 GMT Content-Length=191500 Via=1.0
> barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close
> Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from
> barracuda.acp.org Server=Apache/2.2.3 (Red Hat) Parse Metadata:
> CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
> 
> 
> 
> -----Original Message-
> From: alx...@aim.com [mailto:alx...@aim.com]
> Sent: Tuesday, December 20, 2011 2:15 PM
> To: user@nutch.apache.org
> Subject: Re: Can't crawl a domain; can't figure out why.
> 
> It seems that  robots.txt in
> libraries.mit.edu
> 
> 
> has a lot of restrictions.
> 
> Alex.
> 
> 
&g

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun

://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription 
anchor: 
[http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu anchor: [http://libraries.mit.edu]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
Content Metadata: Date=Tue, 20 Dec 2011 21:30:50 GMT Content-Length=191500 
Via=1.0 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close 
Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from barracuda.acp.org 
Server=Apache/2.2.3 (Red Hat)
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8



-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: Tuesday, December 20, 2011 2:15 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun 
To: user ; 'markus.jel...@openindex.io' 

Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing pe

Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread alxsss

It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun 
To: user ; 'markus.jel...@openindex.io' 

Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

> I'm trying to crawl pages from a number of domains, and one of these 
> domains has been giving me trouble. The really irritating thing is 
> that it did work at least once, which led me to believe that I'd 
> solved the problem. I can't think of anything at this point but to 
> paste my log of a failed crawl and solrindex and hope that someone can 
> think of anything I've overlooked. Does anything look strange here?
> 
> Thanks,
> Chip
> 
> 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
> indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
> - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - threads = 10
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
> 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
> 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
> Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
> crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
> INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
> looking
> in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
> [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
> Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
>  the nutch core extension points (nutch-extensionpoints) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
> Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Html Parse Plug-in (parse-html)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Http / Https Protocol Plug-in
> (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -HTTP Framework (lib-http)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Pass-through URL Normalizer
> (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
> -Http Protocol Plug-in (protocol-http) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
> Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -CyberNeko HTML Parser
> (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
>Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
> INFO  plugin.PluginRepository -URL Meta Indexing Filter
> (urlmeta) 2011-12-19 16:31

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun

I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

> I'm trying to crawl pages from a number of domains, and one of these 
> domains has been giving me trouble. The really irritating thing is 
> that it did work at least once, which led me to believe that I'd 
> solved the problem. I can't think of anything at this point but to 
> paste my log of a failed crawl and solrindex and hope that someone can 
> think of anything I've overlooked. Does anything look strange here?
> 
> Thanks,
> Chip
> 
> 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
> indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
> - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
> crawl.Crawl - threads = 10
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
> 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
> 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
> Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
> crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
> INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
> looking
> in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
> [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
> Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
>  the nutch core extension points (nutch-extensionpoints) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
> Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Html Parse Plug-in (parse-html)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Http / Https Protocol Plug-in
> (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -HTTP Framework (lib-http)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Pass-through URL Normalizer
> (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
> -Http Protocol Plug-in (protocol-http) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
> Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -CyberNeko HTML Parser
> (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
>Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
> INFO  plugin.PluginRepository -URL Meta Indexing Filter
> (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
>   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Nutch URL N

Re: Can't crawl a domain; can't figure out why.

2011-12-19 Thread Markus Jelsma

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

> I'm trying to crawl pages from a number of domains, and one of these
> domains has been giving me trouble. The really irritating thing is that it
> did work at least once, which led me to believe that I'd solved the
> problem. I can't think of anything at this point but to paste my log of a
> failed crawl and solrindex and hope that someone can think of anything
> I've overlooked. Does anything look strange here?
> 
> Thanks,
> Chip
> 
> 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, indexing
> will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl - crawl
> started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  crawl.Crawl -
> rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  crawl.Crawl -
> threads = 10
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
> 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
> 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
> 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
> Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
> crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO
>  crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
> [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
> Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
>  the nutch core extension points (nutch-extensionpoints) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
> Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Html Parse Plug-in (parse-html)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Http / Https Protocol Plug-in
> (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -HTTP Framework (lib-http)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Pass-through URL Normalizer
> (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
> -Http Protocol Plug-in (protocol-http) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
> Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -CyberNeko HTML Parser
> (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
>Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
> INFO  plugin.PluginRepository -URL Meta Indexing Filter
> (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
>   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
> 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
> Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19
> 16:31:02,917 INFO  plugin.PluginRepository -Nutch Protocol
> (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter) 2011-12-19 16:31:02,917 INFO
>  plugin.PluginRepository -Nutch URL Filter
> (org.apache.nutch.net.URLFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Nutch Content Parser
> (org.apache.nutch.parse.Parser) 2011-12-19 16:31:02,917 INFO 
> plugin.PluginRepository -Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter) 2011-12-19 16:31:02,964 INFO 
> regex.RegexURLNormalizer - can't find rules for scope 'inject', using
> default 2011-12-19 16:31:05,72

Can't crawl a domain; can't figure out why.

2011-12-19 Thread Chip Calhoun

I'm trying to crawl pages from a number of domains, and one of these domains 
has been giving me trouble. The really irritating thing is that it did work at 
least once, which led me to believe that I'd solved the problem. I can't think 
of anything at this point but to paste my log of a failed crawl and solrindex 
and hope that someone can think of anything I've overlooked. Does anything look 
strange here?

Thanks,
Chip

2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, indexing will 
be skipped...
2011-12-19 16:31:01,404 INFO  crawl.Crawl - crawl started in: mit-c-crawl
2011-12-19 16:31:01,420 INFO  crawl.Crawl - rootUrlDir = mit-c-urls
2011-12-19 16:31:01,420 INFO  crawl.Crawl - threads = 10
2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at 2011-12-19 
16:31:01
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: crawlDb: 
mit-c-crawl/crawldb
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: urlDir: mit-c-urls
2011-12-19 16:31:01,436 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: looking in: 
C:\Apache\apache-nutch-1.4\runtime\local\plugins
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -the 
nutch core extension points (nutch-extensionpoints)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
URL Normalizer (urlnormalizer-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Html 
Parse Plug-in (parse-html)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
Indexing Filter (index-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http / 
Https Protocol Plug-in (protocol-httpclient)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTTP 
Framework (lib-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter (urlfilter-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http 
Protocol Plug-in (protocol-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Normalizer (urlnormalizer-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Tika 
Parser Plug-in (parse-tika)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -OPIC 
Scoring Plug-in (scoring-opic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Anchor 
Indexing Filter (index-anchor)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -URL Meta 
Indexing Filter (urlmeta)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter Framework (lib-regex-filter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Protocol (org.apache.nutch.protocol.Protocol)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Filter (org.apache.nutch.net.URLFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTML 
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Content Parser (org.apache.nutch.parse.Parser)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Scoring (org.apache.nutch.scoring.ScoringFilter)
2011-12-19 16:31:02,964 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2011-12-19 16:31:05,722 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2011-12-19 16:31:07,014 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2011-12-19 16:31:07,897 INFO  crawl.Injector - Injector: finished at 2011-12-19 
16:31:07, elapsed: 00:00:06
2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: starting at 
20

Re: Can't crawl a domain; can't figure out why.

RE: Can't crawl a domain; can't figure out why.

Re: Can't crawl a domain; can't figure out why.

RE: Can't crawl a domain; can't figure out why.

Re: Can't crawl a domain; can't figure out why.

Can't crawl a domain; can't figure out why.

6 matches

Site Navigation

Mail list logo

Footer information