crawl websites out of own java (mojarra 2.0.2) web application and without using bin/nutch

toocrazymail Wed, 12 May 2010 07:11:05 -0700

hi :) 

i am trying to using nutch without bin/nutch from my own (java) mojarra 2.0.2 
webapp... i am searching at google for examples, but there are no examples how 
i can realize this :/ ... i get an exception an the job fails :/ (i think of 
cause something with hadoop)... here is my code:


    public void run() throws Exception {
        final String[] args = new String[] {
            String.format("%s%s%s", this.rootPath, File.separator, 
DIRECTORY_URLS),
            "-dir", String.format("%s%s%s", this.rootPath, File.separator, 
DIRECTORY_CRAWL),
            "-threads", this.preferences.get("threads"),
            "-depth", this.preferences.get("depth"),
            "-topN", this.preferences.get("topN"),
            "-solr", this.preferences.get("solr")
        };
        Crawl.main(args);
    }

and a part of the logging:

10/05/12 15:44:25 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with 
processName=JobTracker, sessionId= - already initialized
10/05/12 15:44:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
the arguments. Applications should implement Tool for the same.
10/05/12 15:44:25 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/12 15:44:26 INFO mapred.JobClient: Running job: job_local_0003
10/05/12 15:44:26 INFO mapred.FileInputFormat: Total input paths to process : 1
10/05/12 15:44:26 INFO mapred.MapTask: numReduceTasks: 1
10/05/12 15:44:26 INFO mapred.MapTask: io.sort.mb = 100
10/05/12 15:44:26 WARN mapred.LocalJobRunner: job_local_0003
java.lang.OutOfMemoryError: Java heap space

can someone help me or tell me how i can crawling from a java application? i 
have the Xms und Xmx increased, but nothing changed...

best regards marcel :)

crawl websites out of own java (mojarra 2.0.2) web application and without using bin/nutch

Reply via email to