Re: Questions/issues with nutch

h b Fri, 28 Jun 2013 19:25:22 -0700

Ok, I tweaked the code a bit to extract the html as is from the parser, to
realize that it is too much of a text and too much depth of crawling. So I
am looking to see if I can somehow limit the depth. Nutch 1.x docs mention
about the -depth parameter. However, I do not see this in the
nutch-default.xml under Nutch 2.x. The -topN is used for number of links
per depth. So for Nutch 2.x where/how do I set the depth?



On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote:

> Ok, SO i also got this work with Solr 4 no errors, I think the key was not
> using a crawl id.
> I had to comment the updatelog in solrconfig.xml because I got some
> "_version_" related error.
>
> My next questions is, my solr document, or for that matter even the hbase
> value of the html content is 'not html'. It appears that nutch is
> extracting out text only. How do I retain the html content "as is".
>
>
>
>
>
>
> On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil <[email protected]>wrote:
>
>> Kewl !!
>>
>> I wonder why "org.apache.solr.common.SolrException: undefined field text"
>> happens.. Anybody who can throw light on this ?
>>
>>
>> On Fri, Jun 28, 2013 at 10:45 AM, h b <[email protected]> wrote:
>>
>> > Thanks Tejas
>> > I tried these steps, One step I added, was updatedb
>> >
>> > *bin/nutch updatedb*
>> >
>> > Just to be consistent with the doc, and your suggestion on some other
>> > thread, I used solr 3.6 instead of 4.x
>> > I copied the schema.xml from nutch/conf (rootlevel) and started solr. It
>> > failed with
>> >
>> > SEVERE: org.apache.solr.common.SolrException: undefined field text
>> >
>> >
>> > One of the google thread, suggested I ignore this error, so I ignored
>> and
>> > indexed anyway
>> >
>> > So now I got it to work. Playing some more with the queries
>> >
>> >
>> >
>> >
>> > On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <[email protected]
>> > >wrote:
>> >
>> > > The "storage.schema.webpage" seems messed up but I don't have ample
>> time
>> > > now to look into it. Here is what I would suggest to get things
>> working:
>> > > *
>> > > *
>> > > *[1] Remove all the old data from HBase*
>> > >
>> > > (I assume that HBase is running while you do this)
>> > > *cd $HBASE_HOME*
>> > > *./bin/hbase shell
>> > > *
>> > > In the HBase shell, use "list" to see all the tables, delete all of
>> those
>> > > related to Nutch (ones named as *webpage).
>> > > Remove them using "disable" and "drop" commands.
>> > >
>> > > eg. if one of the tables is "webpage", you would run this:
>> > > *disable 'webpage'
>> > > *
>> > > *drop 'webpage'*
>> > > * *
>> > >
>> > > *[2] Run crawl*
>> > > I assume that you have not changed "storage.schema.webpage" is
>> > > nutch-site.xml and nutch-default.xml. If yes, revert it to:
>> > >
>> > > *<property>*
>> > > *  <name>storage.schema.webpage</**name>*
>> > > *  <value>webpage</value>*
>> > > *  <description>This value holds the schema name used for Nutch web
>> db.*
>> > > *  Note that Nutch ignores the value in the gora mapping files, and
>> uses*
>> > > *  this as the webpage schema name.*
>> > > *  </description>*
>> > > *</property>*
>> > >
>> > > Run crawl commands:
>> > > *bin/nutch inject urls/*
>> > > *bin/nutch generate -topN 50000  -noFilter -adddays 0*
>> > > *bin/nutch fetch -all -threads 5  *
>> > > *bin/nutch parse -all *
>> > >
>> > > *[3] Perform indexing*
>> > > I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml
>> copied
>> > in
>> > > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details.
>> > > Start solr and run the indexing command:
>> > > *bin/nutch solrindex  $SOLR_URL -all *
>> > >
>> > > [0] : http://wiki.apache.org/nutch/NutchTutorial
>> > >
>> > > Thanks,
>> > > Tejas
>> > >
>> > > On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]> wrote:
>> > >
>> > > > Ok, so avro did not work quite well for me, I got a test grid with
>> > hbase,
>> > > > and I started using that for now. All steps ran without errors and I
>> > see
>> > > my
>> > > > crawled doc in hbase.
>> > > > However, after running the solr integration, and querying solr, I
>> get
>> > > back
>> > > > nothing. Index files look very tiny. The one thing I noted is a
>> message
>> > > > during almost every step
>> > > >
>> > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass
>> match
>> > but
>> > > > mismatching table names  mappingfile schema is 'webpage' vs actual
>> > schema
>> > > > 'crawl2_webpage' , assuming they are the same.
>> > > >
>> > > > This looks suspicious and I think this is the one causing the solr
>> > index
>> > > to
>> > > > be empty. Googling suggested I should edit the nutch-default,xml, I
>> > tried
>> > > > and rebuilt the job but no luck with this message.
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote:
>> > > >
>> > > > > Ok, I ran a ant, ant jar and ant job and that seems to have
>> picked up
>> > > the
>> > > > > config changes.
>> > > > > Now, the inject output shows that it is using AvroStore as Gora
>> > > storage.
>> > > > >
>> > > > > Now I am getting Nullpointer on
>> > > > >
>> > > > > java.lang.NullPointerException
>> > > > >         at
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
>> > > > >         at
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
>> > > > >         at
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
>> > > > >         at
>> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
>> > > > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>> > > > >         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>> > > > >         at java.security.AccessController.doPrivileged(Native
>> Method)
>> > > > >         at javax.security.auth.Subject.doAs(Subject.java:396)
>> > > > >         at
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
>> > > > >         at org.apache.hadoop.mapred.Child.main(Child.java:264)
>> > > > >
>> > > > > which does not look like nutch related. I will work on this and
>> write
>> > > > back
>> > > > > if I get stuck on something else, or will write back if I succeed.
>> > > > >
>> > > > >
>> > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote:
>> > > > >
>> > > > >> Hi Lewis,
>> > > > >>
>> > > > >> Sorry for missing that one. So I update the top level conf and
>> > rebuild
>> > > > >> the job.
>> > > > >>
>> > > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
>> > > > >>
>> > > > >> ......
>> > > > >>   <property>
>> > > > >>     <name>storage.data.store.class</name>
>> > > > >>     <value>org.apache.gora.avro.store.AvroStore</value>
>> > > > >>   </property>
>> > > > >> ......
>> > > > >>
>> > > > >> cd ~/nutch/apache-nutch-2.2/
>> > > > >> ant job
>> > > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
>> > > > >>
>> > > > >>
>> > > > >> bin/nutch inject urls -crawlId crawl1
>> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting
>> at
>> > > > >> 2013-06-27 17:12:01
>> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting
>> > > urlDir:
>> > > > >> urls
>> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using
>> class
>> > > > >> org.apache.gora.memory.store.MemStore as the Gora storage class.
>> > > > >>
>> > > > >> It still shows me MemStore.
>> > > > >>
>> > > > >> In the jobtracker I see a [crawl1]inject urls job does not have
>> > > > >> urls_injected property
>> > > > >> I have a *db.score.injected* 1.0, but dont think that is
>> anything to
>> > > say
>> > > > >> about urls injected.
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
>> > > > >> [email protected]> wrote:
>> > > > >>
>> > > > >>> Hi,
>> > > > >>> Please re-read my mail.
>> > > > >>> If you are using the deploy directory e.g. running on a hadoop
>> > > cluster,
>> > > > >>> then make sure to edit nutch-site.xml from within the top level
>> > conf
>> > > > >>> directory _not_ the conf directory in runtime/local.
>> > > > >>> If you look at the ant runtime target in the build script you
>> will
>> > > see
>> > > > >>> the
>> > > > >>> code which generates the runtime directory structure.
>> > > > >>> Make changes to conf/nutch-site.xml, build the job jar,
>> navigate to
>> > > > >>> runtime/deploy, run the code.
>> > > > >>> It's easier to make the job jar and scripts in deploy available
>> to
>> > > the
>> > > > >>> job
>> > > > >>> tracker.
>> > > > >>> You also didn't comment on the counters for the inject job. Do
>> you
>> > > see
>> > > > >>> any?
>> > > > >>> Best
>> > > > >>> Lewis
>> > > > >>>
>> > > > >>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>> > > > >>> > Here is an example of what I am saying about the config
>> changes
>> > not
>> > > > >>> taking
>> > > > >>> > effect.
>> > > > >>> >
>> > > > >>> > cd runtime/deploy
>> > > > >>> > cat ../local/conf/nutch-site.xml
>> > > > >>> > ......
>> > > > >>> >
>> > > > >>> >   <property>
>> > > > >>> >     <name>storage.data.store.class</name>
>> > > > >>> >     <value>org.apache.gora.avro.store.AvroStore</value>
>> > > > >>> >   </property>
>> > > > >>> > .....
>> > > > >>> >
>> > > > >>> > cd ../..
>> > > > >>> >
>> > > > >>> > ant job
>> > > > >>> >
>> > > > >>> > cd runtime/deploy
>> > > > >>> > bin/nutch inject urls -crawlId crawl1
>> > > > >>> > .....
>> > > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using
>> > class
>> > > > >>> > org.apache.gora.memory.store.MemStore as the Gora storage
>> class.
>> > > > >>> > .....
>> > > > >>> >
>> > > > >>> > So the nutch-site.xml was changed to use AvroStore as storage
>> > class
>> > > > and
>> > > > >>> job
>> > > > >>> > was rebuilt, and I reran inject, the output of which still
>> shows
>> > > that
>> > > > >>> it
>> > > > >>> is
>> > > > >>> > trying to use Memstore.
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> >
>> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
>> > > > >>> > [email protected]> wrote:
>> > > > >>> >
>> > > > >>> >> The Gora MemStore was introduced to deal predominantly with
>> test
>> > > > >>> scenarios.
>> > > > >>> >> This is justified as the 2.x code is pulled nightly and after
>> > > every
>> > > > >>> commit
>> > > > >>> >> and tested.
>> > > > >>> >> It is nnot thread safe and should not be used (until we fix
>> some
>> > > > >>> issues)
>> > > > >>> >> for any kind of serious deployment.
>> > > > >>> >> From your inject task on the job tracker, you will be able to
>> > see
>> > > > >>> >> 'urls_injected' counters which represent the number of urls
>> > > actually
>> > > > >>> >> persisted through Gora into the datastore.
>> > > > >>> >> I understand that HBase is not an option. Gora should also
>> > support
>> > > > >>> writing
>> > > > >>> >> the output into Avro sequence files... which can be pumped
>> into
>> > > > hdfs.
>> > > > >>> We
>> > > > >>> >> have done some work on this so I suppose that right now is as
>> > > good a
>> > > > >>> time
>> > > > >>> >> as any for you to try it out.
>> > > > >>> >> use the default datastore as
>> > org.apache.gora.avro.store.AvroStore
>> > > I
>> > > > >>> think.
>> > > > >>> >> You can double check by looking into gora.properties
>> > > > >>> >> As a note, youu should use nutch-site.xml within the top
>> level
>> > > conf
>> > > > >>> >> directory for all your Nutch configuration. You should then
>> > > create a
>> > > > >>> new
>> > > > >>> >> job jar for use in hadoop by calling 'ant job' after the
>> changes
>> > > are
>> > > > >>> made.
>> > > > >>> >> hth
>> > > > >>> >> Lewis
>> > > > >>> >>
>> > > > >>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
>> > > > >>> >> > The quick responses flowing are very encouraging. Thanks
>> > Tejas.
>> > > > >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it
>> step
>> > by
>> > > > >>> step.
>> > > > >>> >> >
>> > > > >>> >> > So first I ran the inject command and then the readdb with
>> > dump
>> > > > >>> option
>> > > > >>> >> and
>> > > > >>> >> > did not see anything in the dump files, that leads me to
>> say
>> > > that
>> > > > >>> the
>> > > > >>> >> > inject did not work.I verified the regex-urlfilter and made
>> > sure
>> > > > >>> that
>> > > > >>> my
>> > > > >>> >> > url is not getting filtered.
>> > > > >>> >> >
>> > > > >>> >> > I agree that the second link is about configuring HBase as
>> a
>> > > > >>> storageDB.
>> > > > >>> >> > However, I do not have Hbase installed and dont foresee
>> > getting
>> > > it
>> > > > >>> >> > installed any sooner, hence using HBase for storage is not
>> a
>> > > > option,
>> > > > >>> so I
>> > > > >>> >> > am going to have to stick to Gora with memory store.
>> > > > >>> >> >
>> > > > >>> >> >
>> > > > >>> >> >
>> > > > >>> >> >
>> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
>> > > > >>> [email protected]
>> > > > >>> >> >wrote:
>> > > > >>> >> >
>> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]>
>> > wrote:
>> > > > >>> >> >>
>> > > > >>> >> >> > Thanks for the response Lewis.
>> > > > >>> >> >> > I did read these links, I mostly followed the first link
>> > and
>> > > > >>> tried
>> > > > >>> >> both
>> > > > >>> >> >> the
>> > > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null
>> > > pointer
>> > > > >>> >> exception
>> > > > >>> >> >> on
>> > > > >>> >> >> > solr, so I figured that I should first deal with getting
>> > the
>> > > > >>> crawl
>> > > > >>> >> part
>> > > > >>> >> >> to
>> > > > >>> >> >> > work and then deal with solr indexing. Hence I went
>> back to
>> > > > >>> trying
>> > > > >>> it
>> > > > >>> >> >> > stepwise.
>> > > > >>> >> >> >
>> > > > >>> >> >>
>> > > > >>> >> >> You should try running the crawl using individual commands
>> > and
>> > > > see
>> > > > >>> where
>> > > > >>> >> >> the problem is. The nutch tutorial which Lewis pointed
>> you to
>> > > had
>> > > > >>> those
>> > > > >>> >> >> commands. Even peeking into the bin/crawl script would
>> also
>> > > help
>> > > > >>> as it
>> > > > >>> >> >> calls the nutch commands.
>> > > > >>> >> >>
>> > > > >>> >> >> >
>> > > > >>> >> >> > As for the second link, it is more about using HBase as
>> > store
>> > > > >>> instead
>> > > > >>> >> of
>> > > > >>> >> >> > gora. This is not really a option for me yet, cause my
>> grid
>> > > > does
>> > > > >>> not
>> > > > >>> >> have
>> > > > >>> >> >> > hbase installed yet. Getting it done is not much under
>> my
>> > > > control
>> > > > >>> >> >> >
>> > > > >>> >> >>
>> > > > >>> >> >> HBase is one of the datastores supported by Apache Gora.
>> That
>> > > > >>> tutorial
>> > > > >>> >> >> speaks about how to configure Nutch (actually Gora) to use
>> > > HBase
>> > > > >>> as a
>> > > > >>> >> >> backend. So, its wrong to say that the tutorial was about
>> > HBase
>> > > > and
>> > > > >>> not
>> > > > >>> >> >> Gora.
>> > > > >>> >> >>
>> > > > >>> >> >> >
>> > > > >>> >> >> > the FAQ link is the one I had not gone through until I
>> > > checked
>> > > > >>> your
>> > > > >>> >> >> > response, but I do not find answers to any of my
>> questions
>> > > > >>> >> >> > (directly/indirectly) in it.
>> > > > >>> >> >> >
>> > > > >>> >> >>
>> > > > >>> >> >> Ok
>> > > > >>> >> >>
>> > > > >>> >> >> >
>> > > > >>> >> >> >
>> > > > >>> >> >> >
>> > > > >>> >> >> >
>> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney <
>> > > > >>> >> >> > > *Lewis*
>> > > > >>> >>
>> > > > >>> >
>> > > > >>>
>> > > > >>> --
>> > > > >>> *Lewis*
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Questions/issues with nutch

Reply via email to