Nutch-1.2 Crawling Chinese Web Pages

Dennis Tue, 19 Oct 2010 18:53:41 -0700

Hi,
I am trying to crawl Chinese web pages using nutch-1.2. Does anyone have any 
tutorial on this? I got a lot errors during the configuration. See the 
following:
ThanksDennis
b...@ubuntu:~/workspacecloud2/analysis$ javacc NutchAnalysis.jj Java Compiler 
Compiler Version 5.0 (Parser Generator)(type "javacc" with no arguments for 
help)Reading from file NutchAnalysis.jj . . .Warning: Line 23, Column 3: Bad 
option name "OPTIMIZE_TOKEN_MANAGER".  Option setting will be ignored.Note: 
UNICODE_INPUT option is specified. Please make sure you create the parser/lexer 
using a Reader with the correct character encoding.File "TokenMgrError.java" 
does not exist.  Will create one.File "ParseException.java" does not exist.  
Will create one.File "Token.java" does not exist.  Will create one.File 
"CharStream.java" does not exist.  Will create one.Parser generated with 0 
errors and 1 warnings.


b...@ubuntu:~/workspacecloud2/nutch-1.2$ bin/nutch crawl urls -dir crawl -depth 
2crawl started in: crawlrootUrlDir = urlsthreads = 10depth = 
2indexer=luceneInjector: starting at 2010-10-20 09:35:50Injector: crawlDb: 
crawl/crawldbInjector: urlDir: urlsInjector: Converting injected urls to crawl 
db entries.Injector: Merging injected urls into crawl db.Injector: finished at 
2010-10-20 09:36:29, elapsed: 00:00:38Generator: starting at 2010-10-20 
09:36:29Generator: Selecting best-scoring urls due for fetch.Generator: 
filtering: trueGenerator: normalizing: trueGenerator: jobtracker is 'local', 
generating exactly one partition.Exception in thread "main" 
java.io.IOException: Job failed! at 
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)       at 
org.apache.nutch.crawl.Generator.generate(Generator.java:526)        at 
org.apache.nutch.crawl.Generator.generate(Generator.java:431)        at 
org.apache.nutch.crawl.Crawl.main(Crawl.java:126)

Nutch-1.2 Crawling Chinese Web Pages

Reply via email to