Dear Weiwei,
I ran into a similar issue with Nutch 1.2 release. This was already
discussed here:
http://search.lucidimagination.com/search/document/e63dfbb91194cbbd/cpu_100#464de23fdacc40f5
I see around 200 running threads after executing jstack (a command in
the bin/ directory from Sun JDK that takes the pid as input) looking
like:
java.lang.Thread.State: RUNNABLE
at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:248)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
The shipped jar that includes the Tika parser library is:
$NUTCH_HOME/plugins/parse-tika/tika-parsers-0.7.jar
I did not run into the problem anymore after I used the Tika 0.8
snapshot. I guess one way to fix the problem is to replace it with the
trunk version from SVN and build it with Maven:
$ export TIKA_HOME=./tika
$ svn co http://svn.apache.org/repos/asf/tika/trunk $TIKA_HOME
$ cd $TIKA_HOME
$ mvn install
$ rm $NUTCH_HOME/plugins/parse-tika/tika-parsers-0.7.jar
$ cp $TIKA_HOME/tika-parsers/target/tika-parsers-0.9-SNAPSHOT.jar
$NUTCH_HOME/plugins/parse-tika/
Hope it helps. Please let us know if that would fix your issue.
Alexis
On Sat, Nov 27, 2010 at 10:55 AM, Weiwei Xiong <[email protected]> wrote:
> Thanks for your tips Xiao.
>
> I am currently trying to use Nutch on a single machine so I didn't change
> any Hadoop related configurations. Or should I? I assume Nutch sets the
> default number of map/reduce task to 1. Is this true?
>
> If I have to change the Hadoop mapreduce configurations in a single machine
> environment, Could anyone help to tell me which is the file I should change?
> I tried to specify the number of map and reduce task numbers but it didn't
> work out.
> Below is the configurations I added into mapred-site.xml:
>
> <property>
> <name>mapred.map.tasks</name>
> <value>1</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>1</value>
> </property>
>
>
> Thanks,
> -- Weiwei
>
> On Sat, Nov 27, 2010 at 7:36 AM, xiao yang <[email protected]> wrote:
>
>> Hi, Weiwei
>>
>> What about the configuration of Hadoop?
>> Maybe there're 10 processes with 1 thread each.
>>
>> Thanks!
>> Xiao
>>
>> On 11/27/10, Weiwei Xiong <[email protected]> wrote:
>> > Hi All,
>> >
>> > I'am trying to use nutch to crawl some websites but got a full CPU usage
>> > after it got to depth 2 or 3. I couldn't do anything with the machine but
>> > have to stop the crawling. This happened even when I configured to use
>> only
>> > ONE fetcher thread.
>> > One weird thing I noticed is that the number of threads keeps growing
>> after
>> > running sometime.
>> >
>> > Does anyone have any hint to solve this problem?
>> >
>> > Thanks.
>> > -- ww
>> >
>>
>