RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun
I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

 I'm trying to crawl pages from a number of domains, and one of these 
 domains has been giving me trouble. The really irritating thing is 
 that it did work at least once, which led me to believe that I'd 
 solved the problem. I can't think of anything at this point but to 
 paste my log of a failed crawl and solrindex and hope that someone can 
 think of anything I've overlooked. Does anything look strange here?
 
 Thanks,
 Chip
 
 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
 indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
 - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - threads = 10
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
 Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
 crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
 looking
 in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
 [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
 Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
  the nutch core extension points (nutch-extensionpoints) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
 Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Html Parse Plug-in (parse-html)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Http / Https Protocol Plug-in
 (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -HTTP Framework (lib-http)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Pass-through URL Normalizer
 (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
 -Http Protocol Plug-in (protocol-http) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
 Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -CyberNeko HTML Parser
 (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
 INFO  plugin.PluginRepository -URL Meta Indexing Filter
 (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Nutch Protocol
 (org.apache.nutch.protocol.Protocol) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Nutch Segment Merge Filter
 (org.apache.nutch.segment.SegmentMergeFilter) 2011-12

Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread alxsss
It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org; 'markus.jel...@openindex.io' 
markus.jel...@openindex.io
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the 
domain you can't crawl. libraries.mit.edu seems to work, although the indexer 
doesn't seem to send a document in and the parser doesn't mention parsing that 
file.

Either the file throws a parse error or is truncated or 

 I'm trying to crawl pages from a number of domains, and one of these 
 domains has been giving me trouble. The really irritating thing is 
 that it did work at least once, which led me to believe that I'd 
 solved the problem. I can't think of anything at this point but to 
 paste my log of a failed crawl and solrindex and hope that someone can 
 think of anything I've overlooked. Does anything look strange here?
 
 Thanks,
 Chip
 
 2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, 
 indexing will be skipped... 2011-12-19 16:31:01,404 INFO  crawl.Crawl 
 - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO  
 crawl.Crawl - threads = 10
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
 2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
 2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at
 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO  crawl.Injector -
 Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO 
 crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 
 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
 2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: 
 looking
 in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation mode:
 [true] 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered
 Plugins: 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -  
  the nutch core extension points (nutch-extensionpoints) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Basic URL
 Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Html Parse Plug-in (parse-html)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Http / Https Protocol Plug-in
 (protocol-httpclient) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -HTTP Framework (lib-http)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Pass-through URL Normalizer
 (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository
 -Http Protocol Plug-in (protocol-http) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository -Regex URL
 Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -Tika Parser Plug-in (parse-tika)
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO 
 plugin.PluginRepository -CyberNeko HTML Parser
 (lib-nekohtml) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917
 INFO  plugin.PluginRepository -URL Meta Indexing Filter
 (urlmeta) 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - 
   Regex URL Filter Framework (lib-regex-filter) 2011-12-19
 16:31:02,917 INFO  plugin.PluginRepository - Registered Extension-Points:
 2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -   
 Nutch

RE: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Chip Calhoun
://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription 
anchor: 
[http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcription]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu anchor: [http://libraries.mit.edu]
  outlink: toUrl: 
http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor: 
[http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html anchor: 
[http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
  outlink: toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf 
anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1]
  outlink: toUrl: 
http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html 
anchor: 
[http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association.html]
  outlink: toUrl: 
http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.html#toc
 anchor: Return to Table of Contents ╗
Content Metadata: Date=Tue, 20 Dec 2011 21:30:50 GMT Content-Length=191500 
Via=1.0 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close 
Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from barracuda.acp.org 
Server=Apache/2.2.3 (Red Hat)
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8



-Original Message-
From: alx...@aim.com [mailto:alx...@aim.com] 
Sent: Tuesday, December 20, 2011 2:15 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

It seems that  robots.txt in 
libraries.mit.edu 


has a lot of restrictions.

Alex.

 

-Original Message-
From: Chip Calhoun ccalh...@aip.org
To: user user@nutch.apache.org; 'markus.jel...@openindex.io' 
markus.jel...@openindex.io
Sent: Tue, Dec 20, 2011 7:28 am
Subject: RE: Can't crawl a domain; can't figure out why.


I just compared this against a similar crawl of a completely different domain 
which I know works, and you're right on both counts. The parser doesn't parse a 
file, and nothing is sent to the solrindexer. I tried a crawl with more 
documents and found that while I can get documents from mit.edu, I get 
absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 
1.3 
as well.

I don't think we're dealing with truncated files. I'm willing to believe it's a 
parse error, but how could I tell? I've spoken with some helpful people from 
MIT, and they don't see a reason why this wouldn't work.

Chip

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Monday, December 19, 2011 5:01 PM
To: user@nutch.apache.org
Subject: Re: Can't crawl a domain; can't figure out why.

Nothing peculiar, looks like

Re: Can't crawl a domain; can't figure out why.

2011-12-20 Thread Markus Jelsma
#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
 http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
 html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
 http://libraries.mit.edu/archives/timeline/letter1846.html anchor:
 [http://libraries.mit.edu/archives/timeline/letter1846.html] outlink:
 toUrl:
 http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html
 anchor:
 [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
 outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
 toUrl: http://libraries.mit.edu/archives/mithistory/pdf/account.pdf
 anchor: [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf]
 outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
 outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
 toUrl:
 http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcriptio
 n anchor:
 [http://libraries.mit.edu/archives/exhibits/andrew/index1.html#transcripti
 on] outlink: toUrl:
 http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association
 .html anchor:
 [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio
 n.html] outlink: toUrl:
 http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1
 anchor:
 [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page
 1] outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
 toUrl: http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf
 anchor: [http://libraries.mit.edu/archives/mithistory/pdf/scope-plan.pdf]
 outlink: toUrl:
 http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html
 anchor:
 [http://libraries.mit.edu/archives/exhibits/wbr-honeymoon/index1.html]
 outlink: toUrl:
 http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
 html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
 http://libraries.mit.edu anchor: [http://libraries.mit.edu] outlink:
 toUrl: http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf
 anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/objects-plan.pdf]
 outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
 toUrl: http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html
 anchor: [http://libraries.mit.edu/archives/exhibits/wbr/bibliography.html]
 outlink: toUrl:
 http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
 html#toc anchor: Return to Table of Contents ╗ outlink: toUrl:
 http://libraries.mit.edu/archives/mithistory/pdf/account.pdf anchor:
 [http://libraries.mit.edu/archives/mithistory/pdf/account.pdf] outlink:
 toUrl:
 http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page1
 anchor:
 [http://libraries.mit.edu/archives/exhibits/MIT-birthday/charter.html#page
 1] outlink: toUrl:
 http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-association
 .html anchor:
 [http://libraries.mit.edu/archives/exhibits/MIT-birthday/act-of-associatio
 n.html] outlink: toUrl:
 http://libraries.mit.edu/archives/research/collections/collections-mc/mc1.
 html#toc anchor: Return to Table of Contents ╗ Content Metadata: Date=Tue,
 20 Dec 2011 21:30:50 GMT Content-Length=191500 Via=1.0
 barracuda.acp.org:8080 (http_scan/4.0.2.6.19) Connection=close
 Content-Type=text/html Accept-Ranges=bytes X-Cache=MISS from
 barracuda.acp.org Server=Apache/2.2.3 (Red Hat) Parse Metadata:
 CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
 
 
 
 -Original Message-
 From: alx...@aim.com [mailto:alx...@aim.com]
 Sent: Tuesday, December 20, 2011 2:15 PM
 To: user@nutch.apache.org
 Subject: Re: Can't crawl a domain; can't figure out why.
 
 It seems that  robots.txt in
 libraries.mit.edu
 
 
 has a lot of restrictions.
 
 Alex.
 
 
 
 -Original Message-
 From: Chip Calhoun ccalh...@aip.org
 To: user user@nutch.apache.org; 'markus.jel...@openindex.io'
 markus.jel...@openindex.io Sent: Tue, Dec 20, 2011 7:28 am
 Subject: RE: Can't crawl a domain; can't figure out why.
 
 
 I just compared this against a similar crawl of a completely different
 domain which I know works, and you're right on both counts. The parser
 doesn't parse a file, and nothing is sent to the solrindexer. I tried a
 crawl with more documents and found that while I can get documents from
 mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same
 effect using Nutch 1.3 as well.
 
 I don't think we're dealing with truncated files. I'm willing to believe

Can't crawl a domain; can't figure out why.

2011-12-19 Thread Chip Calhoun
I'm trying to crawl pages from a number of domains, and one of these domains 
has been giving me trouble. The really irritating thing is that it did work at 
least once, which led me to believe that I'd solved the problem. I can't think 
of anything at this point but to paste my log of a failed crawl and solrindex 
and hope that someone can think of anything I've overlooked. Does anything look 
strange here?

Thanks,
Chip

2011-12-19 16:31:01,010 WARN  crawl.Crawl - solrUrl is not set, indexing will 
be skipped...
2011-12-19 16:31:01,404 INFO  crawl.Crawl - crawl started in: mit-c-crawl
2011-12-19 16:31:01,420 INFO  crawl.Crawl - rootUrlDir = mit-c-urls
2011-12-19 16:31:01,420 INFO  crawl.Crawl - threads = 10
2011-12-19 16:31:01,420 INFO  crawl.Crawl - depth = 1
2011-12-19 16:31:01,420 INFO  crawl.Crawl - solrUrl=null
2011-12-19 16:31:01,420 INFO  crawl.Crawl - topN = 50
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: starting at 2011-12-19 
16:31:01
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: crawlDb: 
mit-c-crawl/crawldb
2011-12-19 16:31:01,420 INFO  crawl.Injector - Injector: urlDir: mit-c-urls
2011-12-19 16:31:01,436 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2011-12-19 16:31:02,854 INFO  plugin.PluginRepository - Plugins: looking in: 
C:\Apache\apache-nutch-1.4\runtime\local\plugins
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -the 
nutch core extension points (nutch-extensionpoints)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
URL Normalizer (urlnormalizer-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Html 
Parse Plug-in (parse-html)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Basic 
Indexing Filter (index-basic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http / 
Https Protocol Plug-in (protocol-httpclient)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTTP 
Framework (lib-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter (urlfilter-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Http 
Protocol Plug-in (protocol-http)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Normalizer (urlnormalizer-regex)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Tika 
Parser Plug-in (parse-tika)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -OPIC 
Scoring Plug-in (scoring-opic)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Anchor 
Indexing Filter (index-anchor)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -URL Meta 
Indexing Filter (urlmeta)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Regex 
URL Filter Framework (lib-regex-filter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Protocol (org.apache.nutch.protocol.Protocol)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
URL Filter (org.apache.nutch.net.URLFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -HTML 
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Content Parser (org.apache.nutch.parse.Parser)
2011-12-19 16:31:02,917 INFO  plugin.PluginRepository -Nutch 
Scoring (org.apache.nutch.scoring.ScoringFilter)
2011-12-19 16:31:02,964 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2011-12-19 16:31:05,722 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2011-12-19 16:31:07,014 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2011-12-19 16:31:07,897 INFO  crawl.Injector - Injector: finished at 2011-12-19 
16:31:07, elapsed: 00:00:06
2011-12-19 16:31:07,913 INFO  crawl.Generator - Generator: starting at