Re: Questions/issues with nutch

Tejas Patil Sat, 29 Jun 2013 00:50:43 -0700

Yes. Nutch would parse the HTML and extract the content out of it. Tweaking
around the code surrounding the parser would have made that happen. If you
did something else, would you mind sharing it ?


The "depth" is used by the Crawl class in 1.x which is deprecated in 2.x.
Use bin/crawl instead.
While running the "bin/crawl" script, the "<numberOfRounds>" option is
nothing but the depth till which you want the crawling to be performed.

If you want to use the individual commands instead, run generate -> fetch
-> parse -> update multiple times. The crawl script internally does the
same thing.
eg. If you want to fetch till depth 3, this is how you could do:
inject -> (generate -> fetch -> parse -> update)
          -> (generate -> fetch -> parse -> update)
          -> (generate -> fetch -> parse -> update)
               -> solrindex

On Fri, Jun 28, 2013 at 7:24 PM, h b <[email protected]> wrote:

> Ok, I tweaked the code a bit to extract the html as is from the parser, to
> realize that it is too much of a text and too much depth of crawling. So I
> am looking to see if I can somehow limit the depth. Nutch 1.x docs mention
> about the -depth parameter. However, I do not see this in the
> nutch-default.xml under Nutch 2.x. The -topN is used for number of links
> per depth. So for Nutch 2.x where/how do I set the depth?
>
>
> On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote:
>
> > Ok, SO i also got this work with Solr 4 no errors, I think the key was
> not
> > using a crawl id.
> > I had to comment the updatelog in solrconfig.xml because I got some
> > "_version_" related error.
> >
> > My next questions is, my solr document, or for that matter even the hbase
> > value of the html content is 'not html'. It appears that nutch is
> > extracting out text only. How do I retain the html content "as is".
> >
> >
> >
> >
> >
> >
> > On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil <[email protected]
> >wrote:
> >
> >> Kewl !!
> >>
> >> I wonder why "org.apache.solr.common.SolrException: undefined field
> text"
> >> happens.. Anybody who can throw light on this ?
> >>
> >>
> >> On Fri, Jun 28, 2013 at 10:45 AM, h b <[email protected]> wrote:
> >>
> >> > Thanks Tejas
> >> > I tried these steps, One step I added, was updatedb
> >> >
> >> > *bin/nutch updatedb*
> >> >
> >> > Just to be consistent with the doc, and your suggestion on some other
> >> > thread, I used solr 3.6 instead of 4.x
> >> > I copied the schema.xml from nutch/conf (rootlevel) and started solr.
> It
> >> > failed with
> >> >
> >> > SEVERE: org.apache.solr.common.SolrException: undefined field text
> >> >
> >> >
> >> > One of the google thread, suggested I ignore this error, so I ignored
> >> and
> >> > indexed anyway
> >> >
> >> > So now I got it to work. Playing some more with the queries
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <
> [email protected]
> >> > >wrote:
> >> >
> >> > > The "storage.schema.webpage" seems messed up but I don't have ample
> >> time
> >> > > now to look into it. Here is what I would suggest to get things
> >> working:
> >> > > *
> >> > > *
> >> > > *[1] Remove all the old data from HBase*
> >> > >
> >> > > (I assume that HBase is running while you do this)
> >> > > *cd $HBASE_HOME*
> >> > > *./bin/hbase shell
> >> > > *
> >> > > In the HBase shell, use "list" to see all the tables, delete all of
> >> those
> >> > > related to Nutch (ones named as *webpage).
> >> > > Remove them using "disable" and "drop" commands.
> >> > >
> >> > > eg. if one of the tables is "webpage", you would run this:
> >> > > *disable 'webpage'
> >> > > *
> >> > > *drop 'webpage'*
> >> > > * *
> >> > >
> >> > > *[2] Run crawl*
> >> > > I assume that you have not changed "storage.schema.webpage" is
> >> > > nutch-site.xml and nutch-default.xml. If yes, revert it to:
> >> > >
> >> > > *<property>*
> >> > > *  <name>storage.schema.webpage</**name>*
> >> > > *  <value>webpage</value>*
> >> > > *  <description>This value holds the schema name used for Nutch web
> >> db.*
> >> > > *  Note that Nutch ignores the value in the gora mapping files, and
> >> uses*
> >> > > *  this as the webpage schema name.*
> >> > > *  </description>*
> >> > > *</property>*
> >> > >
> >> > > Run crawl commands:
> >> > > *bin/nutch inject urls/*
> >> > > *bin/nutch generate -topN 50000  -noFilter -adddays 0*
> >> > > *bin/nutch fetch -all -threads 5  *
> >> > > *bin/nutch parse -all *
> >> > >
> >> > > *[3] Perform indexing*
> >> > > I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml
> >> copied
> >> > in
> >> > > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details.
> >> > > Start solr and run the indexing command:
> >> > > *bin/nutch solrindex  $SOLR_URL -all *
> >> > >
> >> > > [0] : http://wiki.apache.org/nutch/NutchTutorial
> >> > >
> >> > > Thanks,
> >> > > Tejas
> >> > >
> >> > > On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]> wrote:
> >> > >
> >> > > > Ok, so avro did not work quite well for me, I got a test grid with
> >> > hbase,
> >> > > > and I started using that for now. All steps ran without errors
> and I
> >> > see
> >> > > my
> >> > > > crawled doc in hbase.
> >> > > > However, after running the solr integration, and querying solr, I
> >> get
> >> > > back
> >> > > > nothing. Index files look very tiny. The one thing I noted is a
> >> message
> >> > > > during almost every step
> >> > > >
> >> > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass
> >> match
> >> > but
> >> > > > mismatching table names  mappingfile schema is 'webpage' vs actual
> >> > schema
> >> > > > 'crawl2_webpage' , assuming they are the same.
> >> > > >
> >> > > > This looks suspicious and I think this is the one causing the solr
> >> > index
> >> > > to
> >> > > > be empty. Googling suggested I should edit the nutch-default,xml,
> I
> >> > tried
> >> > > > and rebuilt the job but no luck with this message.
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote:
> >> > > >
> >> > > > > Ok, I ran a ant, ant jar and ant job and that seems to have
> >> picked up
> >> > > the
> >> > > > > config changes.
> >> > > > > Now, the inject output shows that it is using AvroStore as Gora
> >> > > storage.
> >> > > > >
> >> > > > > Now I am getting Nullpointer on
> >> > > > >
> >> > > > > java.lang.NullPointerException
> >> > > > >         at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
> >> > > > >         at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
> >> > > > >         at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
> >> > > > >         at
> >> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
> >> > > > >         at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> >> > > > >         at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> >> > > > >         at java.security.AccessController.doPrivileged(Native
> >> Method)
> >> > > > >         at javax.security.auth.Subject.doAs(Subject.java:396)
> >> > > > >         at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> >> > > > >         at org.apache.hadoop.mapred.Child.main(Child.java:264)
> >> > > > >
> >> > > > > which does not look like nutch related. I will work on this and
> >> write
> >> > > > back
> >> > > > > if I get stuck on something else, or will write back if I
> succeed.
> >> > > > >
> >> > > > >
> >> > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote:
> >> > > > >
> >> > > > >> Hi Lewis,
> >> > > > >>
> >> > > > >> Sorry for missing that one. So I update the top level conf and
> >> > rebuild
> >> > > > >> the job.
> >> > > > >>
> >> > > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
> >> > > > >>
> >> > > > >> ......
> >> > > > >>   <property>
> >> > > > >>     <name>storage.data.store.class</name>
> >> > > > >>     <value>org.apache.gora.avro.store.AvroStore</value>
> >> > > > >>   </property>
> >> > > > >> ......
> >> > > > >>
> >> > > > >> cd ~/nutch/apache-nutch-2.2/
> >> > > > >> ant job
> >> > > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
> >> > > > >>
> >> > > > >>
> >> > > > >> bin/nutch inject urls -crawlId crawl1
> >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting
> >> at
> >> > > > >> 2013-06-27 17:12:01
> >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
> Injecting
> >> > > urlDir:
> >> > > > >> urls
> >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using
> >> class
> >> > > > >> org.apache.gora.memory.store.MemStore as the Gora storage
> class.
> >> > > > >>
> >> > > > >> It still shows me MemStore.
> >> > > > >>
> >> > > > >> In the jobtracker I see a [crawl1]inject urls job does not have
> >> > > > >> urls_injected property
> >> > > > >> I have a *db.score.injected* 1.0, but dont think that is
> >> anything to
> >> > > say
> >> > > > >> about urls injected.
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
> >> > > > >> [email protected]> wrote:
> >> > > > >>
> >> > > > >>> Hi,
> >> > > > >>> Please re-read my mail.
> >> > > > >>> If you are using the deploy directory e.g. running on a hadoop
> >> > > cluster,
> >> > > > >>> then make sure to edit nutch-site.xml from within the top
> level
> >> > conf
> >> > > > >>> directory _not_ the conf directory in runtime/local.
> >> > > > >>> If you look at the ant runtime target in the build script you
> >> will
> >> > > see
> >> > > > >>> the
> >> > > > >>> code which generates the runtime directory structure.
> >> > > > >>> Make changes to conf/nutch-site.xml, build the job jar,
> >> navigate to
> >> > > > >>> runtime/deploy, run the code.
> >> > > > >>> It's easier to make the job jar and scripts in deploy
> available
> >> to
> >> > > the
> >> > > > >>> job
> >> > > > >>> tracker.
> >> > > > >>> You also didn't comment on the counters for the inject job. Do
> >> you
> >> > > see
> >> > > > >>> any?
> >> > > > >>> Best
> >> > > > >>> Lewis
> >> > > > >>>
> >> > > > >>> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> >> > > > >>> > Here is an example of what I am saying about the config
> >> changes
> >> > not
> >> > > > >>> taking
> >> > > > >>> > effect.
> >> > > > >>> >
> >> > > > >>> > cd runtime/deploy
> >> > > > >>> > cat ../local/conf/nutch-site.xml
> >> > > > >>> > ......
> >> > > > >>> >
> >> > > > >>> >   <property>
> >> > > > >>> >     <name>storage.data.store.class</name>
> >> > > > >>> >     <value>org.apache.gora.avro.store.AvroStore</value>
> >> > > > >>> >   </property>
> >> > > > >>> > .....
> >> > > > >>> >
> >> > > > >>> > cd ../..
> >> > > > >>> >
> >> > > > >>> > ant job
> >> > > > >>> >
> >> > > > >>> > cd runtime/deploy
> >> > > > >>> > bin/nutch inject urls -crawlId crawl1
> >> > > > >>> > .....
> >> > > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using
> >> > class
> >> > > > >>> > org.apache.gora.memory.store.MemStore as the Gora storage
> >> class.
> >> > > > >>> > .....
> >> > > > >>> >
> >> > > > >>> > So the nutch-site.xml was changed to use AvroStore as
> storage
> >> > class
> >> > > > and
> >> > > > >>> job
> >> > > > >>> > was rebuilt, and I reran inject, the output of which still
> >> shows
> >> > > that
> >> > > > >>> it
> >> > > > >>> is
> >> > > > >>> > trying to use Memstore.
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> >
> >> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney <
> >> > > > >>> > [email protected]> wrote:
> >> > > > >>> >
> >> > > > >>> >> The Gora MemStore was introduced to deal predominantly with
> >> test
> >> > > > >>> scenarios.
> >> > > > >>> >> This is justified as the 2.x code is pulled nightly and
> after
> >> > > every
> >> > > > >>> commit
> >> > > > >>> >> and tested.
> >> > > > >>> >> It is nnot thread safe and should not be used (until we fix
> >> some
> >> > > > >>> issues)
> >> > > > >>> >> for any kind of serious deployment.
> >> > > > >>> >> From your inject task on the job tracker, you will be able
> to
> >> > see
> >> > > > >>> >> 'urls_injected' counters which represent the number of urls
> >> > > actually
> >> > > > >>> >> persisted through Gora into the datastore.
> >> > > > >>> >> I understand that HBase is not an option. Gora should also
> >> > support
> >> > > > >>> writing
> >> > > > >>> >> the output into Avro sequence files... which can be pumped
> >> into
> >> > > > hdfs.
> >> > > > >>> We
> >> > > > >>> >> have done some work on this so I suppose that right now is
> as
> >> > > good a
> >> > > > >>> time
> >> > > > >>> >> as any for you to try it out.
> >> > > > >>> >> use the default datastore as
> >> > org.apache.gora.avro.store.AvroStore
> >> > > I
> >> > > > >>> think.
> >> > > > >>> >> You can double check by looking into gora.properties
> >> > > > >>> >> As a note, youu should use nutch-site.xml within the top
> >> level
> >> > > conf
> >> > > > >>> >> directory for all your Nutch configuration. You should then
> >> > > create a
> >> > > > >>> new
> >> > > > >>> >> job jar for use in hadoop by calling 'ant job' after the
> >> changes
> >> > > are
> >> > > > >>> made.
> >> > > > >>> >> hth
> >> > > > >>> >> Lewis
> >> > > > >>> >>
> >> > > > >>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote:
> >> > > > >>> >> > The quick responses flowing are very encouraging. Thanks
> >> > Tejas.
> >> > > > >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it
> >> step
> >> > by
> >> > > > >>> step.
> >> > > > >>> >> >
> >> > > > >>> >> > So first I ran the inject command and then the readdb
> with
> >> > dump
> >> > > > >>> option
> >> > > > >>> >> and
> >> > > > >>> >> > did not see anything in the dump files, that leads me to
> >> say
> >> > > that
> >> > > > >>> the
> >> > > > >>> >> > inject did not work.I verified the regex-urlfilter and
> made
> >> > sure
> >> > > > >>> that
> >> > > > >>> my
> >> > > > >>> >> > url is not getting filtered.
> >> > > > >>> >> >
> >> > > > >>> >> > I agree that the second link is about configuring HBase
> as
> >> a
> >> > > > >>> storageDB.
> >> > > > >>> >> > However, I do not have Hbase installed and dont foresee
> >> > getting
> >> > > it
> >> > > > >>> >> > installed any sooner, hence using HBase for storage is
> not
> >> a
> >> > > > option,
> >> > > > >>> so I
> >> > > > >>> >> > am going to have to stick to Gora with memory store.
> >> > > > >>> >> >
> >> > > > >>> >> >
> >> > > > >>> >> >
> >> > > > >>> >> >
> >> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> >> > > > >>> [email protected]
> >> > > > >>> >> >wrote:
> >> > > > >>> >> >
> >> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]>
> >> > wrote:
> >> > > > >>> >> >>
> >> > > > >>> >> >> > Thanks for the response Lewis.
> >> > > > >>> >> >> > I did read these links, I mostly followed the first
> link
> >> > and
> >> > > > >>> tried
> >> > > > >>> >> both
> >> > > > >>> >> >> the
> >> > > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null
> >> > > pointer
> >> > > > >>> >> exception
> >> > > > >>> >> >> on
> >> > > > >>> >> >> > solr, so I figured that I should first deal with
> getting
> >> > the
> >> > > > >>> crawl
> >> > > > >>> >> part
> >> > > > >>> >> >> to
> >> > > > >>> >> >> > work and then deal with solr indexing. Hence I went
> >> back to
> >> > > > >>> trying
> >> > > > >>> it
> >> > > > >>> >> >> > stepwise.
> >> > > > >>> >> >> >
> >> > > > >>> >> >>
> >> > > > >>> >> >> You should try running the crawl using individual
> commands
> >> > and
> >> > > > see
> >> > > > >>> where
> >> > > > >>> >> >> the problem is. The nutch tutorial which Lewis pointed
> >> you to
> >> > > had
> >> > > > >>> those
> >> > > > >>> >> >> commands. Even peeking into the bin/crawl script would
> >> also
> >> > > help
> >> > > > >>> as it
> >> > > > >>> >> >> calls the nutch commands.
> >> > > > >>> >> >>
> >> > > > >>> >> >> >
> >> > > > >>> >> >> > As for the second link, it is more about using HBase
> as
> >> > store
> >> > > > >>> instead
> >> > > > >>> >> of
> >> > > > >>> >> >> > gora. This is not really a option for me yet, cause my
> >> grid
> >> > > > does
> >> > > > >>> not
> >> > > > >>> >> have
> >> > > > >>> >> >> > hbase installed yet. Getting it done is not much under
> >> my
> >> > > > control
> >> > > > >>> >> >> >
> >> > > > >>> >> >>
> >> > > > >>> >> >> HBase is one of the datastores supported by Apache Gora.
> >> That
> >> > > > >>> tutorial
> >> > > > >>> >> >> speaks about how to configure Nutch (actually Gora) to
> use
> >> > > HBase
> >> > > > >>> as a
> >> > > > >>> >> >> backend. So, its wrong to say that the tutorial was
> about
> >> > HBase
> >> > > > and
> >> > > > >>> not
> >> > > > >>> >> >> Gora.
> >> > > > >>> >> >>
> >> > > > >>> >> >> >
> >> > > > >>> >> >> > the FAQ link is the one I had not gone through until I
> >> > > checked
> >> > > > >>> your
> >> > > > >>> >> >> > response, but I do not find answers to any of my
> >> questions
> >> > > > >>> >> >> > (directly/indirectly) in it.
> >> > > > >>> >> >> >
> >> > > > >>> >> >>
> >> > > > >>> >> >> Ok
> >> > > > >>> >> >>
> >> > > > >>> >> >> >
> >> > > > >>> >> >> >
> >> > > > >>> >> >> >
> >> > > > >>> >> >> >
> >> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney
> <
> >> > > > >>> >> >> > > *Lewis*
> >> > > > >>> >>
> >> > > > >>> >
> >> > > > >>>
> >> > > > >>> --
> >> > > > >>> *Lewis*
> >> > > > >>>
> >> > > > >>
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Questions/issues with nutch

Reply via email to