Re: Hadoop .20.205 & Nutch 1.3

Peyman Mohajerian Sat, 24 Dec 2011 14:09:23 -0800

I didn't mean to complain, you are absolutely right and I will make
sure to contribute to the documentation. Nutch is a great application
and I'm learning so much about parsing, hadoop and other concepts as
part of dealing with Nutch.


I had a follow up question, now that I have managed to run Nutch with
Hadoop I noticed for crawling a single URL that takes about 10 sec
without Hadoop it take almost 4 min. with Hadoop (default setup). I
did expect the extra over head but not this much. My use case is this:

I need to crawl a large number of sites that typically would make a
lot of sense to parallelize it using Map/Reduce, but in my case I
don't ever have to crawl a second time, I want to crawl once and index
the result in Solr. So the fact that linkdb, segments and other stuff
is stored in hdfs doesn't really matter to me. Wouldn't be faster if I
just split my input across multiple nodes and run them at the sametime
all writing to a centralize Solr server rather than to have the
overhead of Hadoop and slow hdfs? My understanding is that the real
value with using Hadoop is the fact that you store data the result in
hdfs and the next crawl you don't start from scratch, incremental
crawling. Also want to make sure parameters like
'fetcher.threads.fetch' are independent of whether you're using Hadoop
or not?

Thanks
Peyman

On Sat, Dec 24, 2011 at 9:56 AM, Julien Nioche
<[email protected]> wrote:
>> There is no need to run Nutch following the tutorial, the tutorial is
>> extremely out dated and confusing,
>>
>
> You are welcome to contribute and improve it
>
>
>
>> this worked for me:
>> bin/hadoop jar nutch-1.3.job org.apache.nutch.crawl.Crawl urls -dir
>> crawl -depth 3 -topN 50
>>
>> I got it from:
>>
>> http://www.marco.bianchi.name/myPortal/how-to-run-nutch-13-in-distributed-mode.aspx
>>
>> Thanks,
>> Peyman
>>
>> On Fri, Dec 23, 2011 at 2:11 AM, Markus Jelsma
>> <[email protected]> wrote:
>> > Something on your path is missing. What if you upgrade to Nutch 1.4 and
>> try
>> > again?
>> >
>> > On Thursday 22 December 2011 02:47:21 Peyman Mohajerian wrote:
>> >> Hi Guys,
>> >>
>> >> I run Nutch fine without using Hadoop, but following:
>> >> http://wiki.apache.org/nutch/NutchHadoopTutorial
>> >> I get this error when I start crawling:
>> >> class not found exception on: org/apache/hadoop/util/PlatformName
>> >>
>> >> This class is in hadoop-core-0.20.2.jar that comes with Nutch1.3.
>> >> Initially i didn't copy this file to my 'nutch/lib' directory because
>> >> I assumed hadoop already has this jar and I don't have to copy it from
>> >> Nutch lib over. But due to the above error I decided to copy it over,
>> >> but it didn't help. I'm assuming there is a jar conflict at some
>> >> point. The tutorial is not clear, what I understand from it is that
>> >> I'm supposed to merge all the lib, bin, conf from both hadoop and
>> >> nutch in one location and there are some incompatible jars. I'm using
>> >> Hadoop  .20.205, Running any Map/Reduce job or copying stuff to hdfs
>> >> works just fine.
>> >>
>> >> Any ideas?
>> >>
>> >> Thanks
>> >> Peyman
>> >>
>> >> here is the stack:
>> >> peyman@ubuntu:/host/Users/Peyman/Documents/hadoop-0.20.205.0/nutch$
>> >> bin/nutch crawl /user/peyman/urls -dir fbprofilecrawl -depth 3 -topN
>> >> 50
>> >> Exception in thread "main" java.lang.NoClassDefFoundError:
>> >> org/apache/hadoop/util/PlatformName
>> >> Caused by: java.lang.ClassNotFoundException:
>> >> org.apache.hadoop.util.PlatformName at
>> >> java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> >>       at java.security.AccessController.doPrivileged(Native Method)
>> >>       at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> >>       at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>> >>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> >>       at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>> >> Could not find the main class: org.apache.hadoop.util.PlatformName.
>> >> Program will exit.
>> >> solrUrl is not set, indexing will be skipped...
>> >
>> > --
>> > Markus Jelsma - CTO - Openindex
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com

Re: Hadoop .20.205 & Nutch 1.3

Reply via email to