Re: Nutch-1.2 Crawling Chinese Web Pages

cong liu Sun, 24 Oct 2010 19:41:23 -0700

NO tutorial

this can be solved by plugins mechanism, pls refer other language implement
at plugin folder src\plugin\analysis-de.


On Wed, Oct 20, 2010 at 9:53 AM, Dennis <[email protected]> wrote:

> Hi,
> I am trying to crawl Chinese web pages using nutch-1.2. Does anyone have
> any tutorial on this? I got a lot errors during the configuration. See the
> following:
> ThanksDennis
> b...@ubuntu:~/workspacecloud2/analysis$ javacc NutchAnalysis.jj Java
> Compiler Compiler Version 5.0 (Parser Generator)(type "javacc" with no
> arguments for help)Reading from file NutchAnalysis.jj . . .Warning: Line 23,
> Column 3: Bad option name "OPTIMIZE_TOKEN_MANAGER".  Option setting will be
> ignored.Note: UNICODE_INPUT option is specified. Please make sure you create
> the parser/lexer using a Reader with the correct character encoding.File
> "TokenMgrError.java" does not exist.  Will create one.File
> "ParseException.java" does not exist.  Will create one.File "Token.java"
> does not exist.  Will create one.File "CharStream.java" does not exist.
>  Will create one.Parser generated with 0 errors and 1 warnings.
>
> b...@ubuntu:~/workspacecloud2/nutch-1.2$ bin/nutch crawl urls -dir crawl
> -depth 2crawl started in: crawlrootUrlDir = urlsthreads = 10depth =
> 2indexer=luceneInjector: starting at 2010-10-20 09:35:50Injector: crawlDb:
> crawl/crawldbInjector: urlDir: urlsInjector: Converting injected urls to
> crawl db entries.Injector: Merging injected urls into crawl db.Injector:
> finished at 2010-10-20 09:36:29, elapsed: 00:00:38Generator: starting at
> 2010-10-20 09:36:29Generator: Selecting best-scoring urls due for
> fetch.Generator: filtering: trueGenerator: normalizing: trueGenerator:
> jobtracker is 'local', generating exactly one partition.Exception in thread
> "main" java.io.IOException: Job failed! at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)       at
> org.apache.nutch.crawl.Generator.generate(Generator.java:526)        at
> org.apache.nutch.crawl.Generator.generate(Generator.java:431)        at
> org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
>
>
>
>

Re: Nutch-1.2 Crawling Chinese Web Pages

Reply via email to