Hi,
I am trying to crawl Chinese web pages using nutch-1.2. Does anyone have any
tutorial on this? I got a lot errors during the configuration. See the
following:
ThanksDennis
b...@ubuntu:~/workspacecloud2/analysis$ javacc NutchAnalysis.jj Java Compiler
Compiler Version 5.0 (Parser Generator)(type "javacc" with no arguments for
help)Reading from file NutchAnalysis.jj . . .Warning: Line 23, Column 3: Bad
option name "OPTIMIZE_TOKEN_MANAGER". Option setting will be ignored.Note:
UNICODE_INPUT option is specified. Please make sure you create the parser/lexer
using a Reader with the correct character encoding.File "TokenMgrError.java"
does not exist. Will create one.File "ParseException.java" does not exist.
Will create one.File "Token.java" does not exist. Will create one.File
"CharStream.java" does not exist. Will create one.Parser generated with 0
errors and 1 warnings.
b...@ubuntu:~/workspacecloud2/nutch-1.2$ bin/nutch crawl urls -dir crawl -depth
2crawl started in: crawlrootUrlDir = urlsthreads = 10depth =
2indexer=luceneInjector: starting at 2010-10-20 09:35:50Injector: crawlDb:
crawl/crawldbInjector: urlDir: urlsInjector: Converting injected urls to crawl
db entries.Injector: Merging injected urls into crawl db.Injector: finished at
2010-10-20 09:36:29, elapsed: 00:00:38Generator: starting at 2010-10-20
09:36:29Generator: Selecting best-scoring urls due for fetch.Generator:
filtering: trueGenerator: normalizing: trueGenerator: jobtracker is 'local',
generating exactly one partition.Exception in thread "main"
java.io.IOException: Job failed! at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at
org.apache.nutch.crawl.Generator.generate(Generator.java:526) at
org.apache.nutch.crawl.Generator.generate(Generator.java:431) at
org.apache.nutch.crawl.Crawl.main(Crawl.java:126)