Re: Questions/issues with nutch

h b Sun, 30 Jun 2013 08:41:40 -0700

Because we have a separate non Java legacy process that would take care of
the parsing, and it requires raw html. It's more of a process reasoning
than anything else.
On Jun 30, 2013 8:06 AM, "Tejas Patil" <[email protected]> wrote:


> I am curious to know why do needed the raw html content instead of parsed
> stuff. Search engines are meant to index parsed text. The data to be stored
> and indexed reduces after parsing.
>
>
> On Sat, Jun 29, 2013 at 9:20 PM, h b <[email protected]> wrote:
>
> > Thanks Tejas,
> > I have just 2 urls in my seed file, and the second run of fetch ran for a
> > few hours. I will verify if I got what I wanted.
> >
> > Regarding the raw html, its a ugly hack, so I did not really create a
> > patch. But this is what I did
> >
> >
> > In
> >
> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> > getParse method,
> >
> >       //text = sb.toString();
> >       text = new String(page.getContent().array());
> >
> > Would be nice to make this as a configuration in the plugin xml.
> >
> > Other thing I will try soon is to extract the content only for a specific
> > depth.
> >
> >
> >
> > On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > Yes. Nutch would parse the HTML and extract the content out of it.
> > Tweaking
> > > around the code surrounding the parser would have made that happen. If
> > you
> > > did something else, would you mind sharing it ?
> > >
> > > The "depth" is used by the Crawl class in 1.x which is deprecated in
> 2.x.
> > > Use bin/crawl instead.
> > > While running the "bin/crawl" script, the "<numberOfRounds>" option is
> > > nothing but the depth till which you want the crawling to be performed.
> > >
> > > If you want to use the individual commands instead, run generate ->
> fetch
> > > -> parse -> update multiple times. The crawl script internally does the
> > > same thing.
> > > eg. If you want to fetch till depth 3, this is how you could do:
> > > inject -> (generate -> fetch -> parse -> update)
> > >           -> (generate -> fetch -> parse -> update)
> > >           -> (generate -> fetch -> parse -> update)
> > >                -> solrindex
> > >
> > > On Fri, Jun 28, 2013 at 7:24 PM, h b <[email protected]> wrote:
> > >
> > > > Ok, I tweaked the code a bit to extract the html as is from the
> parser,
> > > to
> > > > realize that it is too much of a text and too much depth of crawling.
> > So
> > > I
> > > > am looking to see if I can somehow limit the depth. Nutch 1.x docs
> > > mention
> > > > about the -depth parameter. However, I do not see this in the
> > > > nutch-default.xml under Nutch 2.x. The -topN is used for number of
> > links
> > > > per depth. So for Nutch 2.x where/how do I set the depth?
> > > >
> > > >
> > > > On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote:
> > > >
> > > > > Ok, SO i also got this work with Solr 4 no errors, I think the key
> > was
> > > > not
> > > > > using a crawl id.
> > > > > I had to comment the updatelog in solrconfig.xml because I got some
> > > > > "_version_" related error.
> > > > >
> > > > > My next questions is, my solr document, or for that matter even the
> > > hbase
> > > > > value of the html content is 'not html'. It appears that nutch is
> > > > > extracting out text only. How do I retain the html content "as is".
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil <
> > > [email protected]
> > > > >wrote:
> > > > >
> > > > >> Kewl !!
> > > > >>
> > > > >> I wonder why "org.apache.solr.common.SolrException: undefined
> field
> > > > text"
> > > > >> happens.. Anybody who can throw light on this ?
> > > > >>
> > > > >>
> > > > >> On Fri, Jun 28, 2013 at 10:45 AM, h b <[email protected]> wrote:
> > > > >>
> > > > >> > Thanks Tejas
> > > > >> > I tried these steps, One step I added, was updatedb
> > > > >> >
> > > > >> > *bin/nutch updatedb*
> > > > >> >
> > > > >> > Just to be consistent with the doc, and your suggestion on some
> > > other
> > > > >> > thread, I used solr 3.6 instead of 4.x
> > > > >> > I copied the schema.xml from nutch/conf (rootlevel) and started
> > > solr.
> > > > It
> > > > >> > failed with
> > > > >> >
> > > > >> > SEVERE: org.apache.solr.common.SolrException: undefined field
> text
> > > > >> >
> > > > >> >
> > > > >> > One of the google thread, suggested I ignore this error, so I
> > > ignored
> > > > >> and
> > > > >> > indexed anyway
> > > > >> >
> > > > >> > So now I got it to work. Playing some more with the queries
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <
> > > > [email protected]
> > > > >> > >wrote:
> > > > >> >
> > > > >> > > The "storage.schema.webpage" seems messed up but I don't have
> > > ample
> > > > >> time
> > > > >> > > now to look into it. Here is what I would suggest to get
> things
> > > > >> working:
> > > > >> > > *
> > > > >> > > *
> > > > >> > > *[1] Remove all the old data from HBase*
> > > > >> > >
> > > > >> > > (I assume that HBase is running while you do this)
> > > > >> > > *cd $HBASE_HOME*
> > > > >> > > *./bin/hbase shell
> > > > >> > > *
> > > > >> > > In the HBase shell, use "list" to see all the tables, delete
> all
> > > of
> > > > >> those
> > > > >> > > related to Nutch (ones named as *webpage).
> > > > >> > > Remove them using "disable" and "drop" commands.
> > > > >> > >
> > > > >> > > eg. if one of the tables is "webpage", you would run this:
> > > > >> > > *disable 'webpage'
> > > > >> > > *
> > > > >> > > *drop 'webpage'*
> > > > >> > > * *
> > > > >> > >
> > > > >> > > *[2] Run crawl*
> > > > >> > > I assume that you have not changed "storage.schema.webpage" is
> > > > >> > > nutch-site.xml and nutch-default.xml. If yes, revert it to:
> > > > >> > >
> > > > >> > > *<property>*
> > > > >> > > *  <name>storage.schema.webpage</**name>*
> > > > >> > > *  <value>webpage</value>*
> > > > >> > > *  <description>This value holds the schema name used for
> Nutch
> > > web
> > > > >> db.*
> > > > >> > > *  Note that Nutch ignores the value in the gora mapping
> files,
> > > and
> > > > >> uses*
> > > > >> > > *  this as the webpage schema name.*
> > > > >> > > *  </description>*
> > > > >> > > *</property>*
> > > > >> > >
> > > > >> > > Run crawl commands:
> > > > >> > > *bin/nutch inject urls/*
> > > > >> > > *bin/nutch generate -topN 50000  -noFilter -adddays 0*
> > > > >> > > *bin/nutch fetch -all -threads 5  *
> > > > >> > > *bin/nutch parse -all *
> > > > >> > >
> > > > >> > > *[3] Perform indexing*
> > > > >> > > I assume that you have Solr setup and
> NUTCH_HOME/conf/schema.xml
> > > > >> copied
> > > > >> > in
> > > > >> > > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for
> > > details.
> > > > >> > > Start solr and run the indexing command:
> > > > >> > > *bin/nutch solrindex  $SOLR_URL -all *
> > > > >> > >
> > > > >> > > [0] : http://wiki.apache.org/nutch/NutchTutorial
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Tejas
> > > > >> > >
> > > > >> > > On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]>
> wrote:
> > > > >> > >
> > > > >> > > > Ok, so avro did not work quite well for me, I got a test
> grid
> > > with
> > > > >> > hbase,
> > > > >> > > > and I started using that for now. All steps ran without
> errors
> > > > and I
> > > > >> > see
> > > > >> > > my
> > > > >> > > > crawled doc in hbase.
> > > > >> > > > However, after running the solr integration, and querying
> > solr,
> > > I
> > > > >> get
> > > > >> > > back
> > > > >> > > > nothing. Index files look very tiny. The one thing I noted
> is
> > a
> > > > >> message
> > > > >> > > > during almost every step
> > > > >> > > >
> > > > >> > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and
> > nameclass
> > > > >> match
> > > > >> > but
> > > > >> > > > mismatching table names  mappingfile schema is 'webpage' vs
> > > actual
> > > > >> > schema
> > > > >> > > > 'crawl2_webpage' , assuming they are the same.
> > > > >> > > >
> > > > >> > > > This looks suspicious and I think this is the one causing
> the
> > > solr
> > > > >> > index
> > > > >> > > to
> > > > >> > > > be empty. Googling suggested I should edit the
> > > nutch-default,xml,
> > > > I
> > > > >> > tried
> > > > >> > > > and rebuilt the job but no luck with this message.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]>
> > wrote:
> > > > >> > > >
> > > > >> > > > > Ok, I ran a ant, ant jar and ant job and that seems to
> have
> > > > >> picked up
> > > > >> > > the
> > > > >> > > > > config changes.
> > > > >> > > > > Now, the inject output shows that it is using AvroStore as
> > > Gora
> > > > >> > > storage.
> > > > >> > > > >
> > > > >> > > > > Now I am getting Nullpointer on
> > > > >> > > > >
> > > > >> > > > > java.lang.NullPointerException
> > > > >> > > > >         at
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
> > > > >> > > > >         at
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
> > > > >> > > > >         at
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
> > > > >> > > > >         at
> > > > >> > > >
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
> > > > >> > > > >         at
> > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
> > > > >> > > > >         at
> > > org.apache.hadoop.mapred.Child$4.run(Child.java:270)
> > > > >> > > > >         at
> > java.security.AccessController.doPrivileged(Native
> > > > >> Method)
> > > > >> > > > >         at
> > javax.security.auth.Subject.doAs(Subject.java:396)
> > > > >> > > > >         at
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
> > > > >> > > > >         at
> > org.apache.hadoop.mapred.Child.main(Child.java:264)
> > > > >> > > > >
> > > > >> > > > > which does not look like nutch related. I will work on
> this
> > > and
> > > > >> write
> > > > >> > > > back
> > > > >> > > > > if I get stuck on something else, or will write back if I
> > > > succeed.
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]>
> > > wrote:
> > > > >> > > > >
> > > > >> > > > >> Hi Lewis,
> > > > >> > > > >>
> > > > >> > > > >> Sorry for missing that one. So I update the top level
> conf
> > > and
> > > > >> > rebuild
> > > > >> > > > >> the job.
> > > > >> > > > >>
> > > > >> > > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
> > > > >> > > > >>
> > > > >> > > > >> ......
> > > > >> > > > >>   <property>
> > > > >> > > > >>     <name>storage.data.store.class</name>
> > > > >> > > > >>     <value>org.apache.gora.avro.store.AvroStore</value>
> > > > >> > > > >>   </property>
> > > > >> > > > >> ......
> > > > >> > > > >>
> > > > >> > > > >> cd ~/nutch/apache-nutch-2.2/
> > > > >> > > > >> ant job
> > > > >> > > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> bin/nutch inject urls -crawlId crawl1
> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
> > > starting
> > > > >> at
> > > > >> > > > >> 2013-06-27 17:12:01
> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
> > > > Injecting
> > > > >> > > urlDir:
> > > > >> > > > >> urls
> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
> > Using
> > > > >> class
> > > > >> > > > >> org.apache.gora.memory.store.MemStore as the Gora storage
> > > > class.
> > > > >> > > > >>
> > > > >> > > > >> It still shows me MemStore.
> > > > >> > > > >>
> > > > >> > > > >> In the jobtracker I see a [crawl1]inject urls job does
> not
> > > have
> > > > >> > > > >> urls_injected property
> > > > >> > > > >> I have a *db.score.injected* 1.0, but dont think that is
> > > > >> anything to
> > > > >> > > say
> > > > >> > > > >> about urls injected.
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
> > > > >> > > > >> [email protected]> wrote:
> > > > >> > > > >>
> > > > >> > > > >>> Hi,
> > > > >> > > > >>> Please re-read my mail.
> > > > >> > > > >>> If you are using the deploy directory e.g. running on a
> > > hadoop
> > > > >> > > cluster,
> > > > >> > > > >>> then make sure to edit nutch-site.xml from within the
> top
> > > > level
> > > > >> > conf
> > > > >> > > > >>> directory _not_ the conf directory in runtime/local.
> > > > >> > > > >>> If you look at the ant runtime target in the build
> script
> > > you
> > > > >> will
> > > > >> > > see
> > > > >> > > > >>> the
> > > > >> > > > >>> code which generates the runtime directory structure.
> > > > >> > > > >>> Make changes to conf/nutch-site.xml, build the job jar,
> > > > >> navigate to
> > > > >> > > > >>> runtime/deploy, run the code.
> > > > >> > > > >>> It's easier to make the job jar and scripts in deploy
> > > > available
> > > > >> to
> > > > >> > > the
> > > > >> > > > >>> job
> > > > >> > > > >>> tracker.
> > > > >> > > > >>> You also didn't comment on the counters for the inject
> > job.
> > > Do
> > > > >> you
> > > > >> > > see
> > > > >> > > > >>> any?
> > > > >> > > > >>> Best
> > > > >> > > > >>> Lewis
> > > > >> > > > >>>
> > > > >> > > > >>> On Wednesday, June 26, 2013, h b <[email protected]>
> > wrote:
> > > > >> > > > >>> > Here is an example of what I am saying about the
> config
> > > > >> changes
> > > > >> > not
> > > > >> > > > >>> taking
> > > > >> > > > >>> > effect.
> > > > >> > > > >>> >
> > > > >> > > > >>> > cd runtime/deploy
> > > > >> > > > >>> > cat ../local/conf/nutch-site.xml
> > > > >> > > > >>> > ......
> > > > >> > > > >>> >
> > > > >> > > > >>> >   <property>
> > > > >> > > > >>> >     <name>storage.data.store.class</name>
> > > > >> > > > >>> >
> <value>org.apache.gora.avro.store.AvroStore</value>
> > > > >> > > > >>> >   </property>
> > > > >> > > > >>> > .....
> > > > >> > > > >>> >
> > > > >> > > > >>> > cd ../..
> > > > >> > > > >>> >
> > > > >> > > > >>> > ant job
> > > > >> > > > >>> >
> > > > >> > > > >>> > cd runtime/deploy
> > > > >> > > > >>> > bin/nutch inject urls -crawlId crawl1
> > > > >> > > > >>> > .....
> > > > >> > > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob:
> > > Using
> > > > >> > class
> > > > >> > > > >>> > org.apache.gora.memory.store.MemStore as the Gora
> > storage
> > > > >> class.
> > > > >> > > > >>> > .....
> > > > >> > > > >>> >
> > > > >> > > > >>> > So the nutch-site.xml was changed to use AvroStore as
> > > > storage
> > > > >> > class
> > > > >> > > > and
> > > > >> > > > >>> job
> > > > >> > > > >>> > was rebuilt, and I reran inject, the output of which
> > still
> > > > >> shows
> > > > >> > > that
> > > > >> > > > >>> it
> > > > >> > > > >>> is
> > > > >> > > > >>> > trying to use Memstore.
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> >
> > > > >> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John
> Mcgibbney <
> > > > >> > > > >>> > [email protected]> wrote:
> > > > >> > > > >>> >
> > > > >> > > > >>> >> The Gora MemStore was introduced to deal
> predominantly
> > > with
> > > > >> test
> > > > >> > > > >>> scenarios.
> > > > >> > > > >>> >> This is justified as the 2.x code is pulled nightly
> and
> > > > after
> > > > >> > > every
> > > > >> > > > >>> commit
> > > > >> > > > >>> >> and tested.
> > > > >> > > > >>> >> It is nnot thread safe and should not be used (until
> we
> > > fix
> > > > >> some
> > > > >> > > > >>> issues)
> > > > >> > > > >>> >> for any kind of serious deployment.
> > > > >> > > > >>> >> From your inject task on the job tracker, you will be
> > > able
> > > > to
> > > > >> > see
> > > > >> > > > >>> >> 'urls_injected' counters which represent the number
> of
> > > urls
> > > > >> > > actually
> > > > >> > > > >>> >> persisted through Gora into the datastore.
> > > > >> > > > >>> >> I understand that HBase is not an option. Gora should
> > > also
> > > > >> > support
> > > > >> > > > >>> writing
> > > > >> > > > >>> >> the output into Avro sequence files... which can be
> > > pumped
> > > > >> into
> > > > >> > > > hdfs.
> > > > >> > > > >>> We
> > > > >> > > > >>> >> have done some work on this so I suppose that right
> now
> > > is
> > > > as
> > > > >> > > good a
> > > > >> > > > >>> time
> > > > >> > > > >>> >> as any for you to try it out.
> > > > >> > > > >>> >> use the default datastore as
> > > > >> > org.apache.gora.avro.store.AvroStore
> > > > >> > > I
> > > > >> > > > >>> think.
> > > > >> > > > >>> >> You can double check by looking into gora.properties
> > > > >> > > > >>> >> As a note, youu should use nutch-site.xml within the
> > top
> > > > >> level
> > > > >> > > conf
> > > > >> > > > >>> >> directory for all your Nutch configuration. You
> should
> > > then
> > > > >> > > create a
> > > > >> > > > >>> new
> > > > >> > > > >>> >> job jar for use in hadoop by calling 'ant job' after
> > the
> > > > >> changes
> > > > >> > > are
> > > > >> > > > >>> made.
> > > > >> > > > >>> >> hth
> > > > >> > > > >>> >> Lewis
> > > > >> > > > >>> >>
> > > > >> > > > >>> >> On Wednesday, June 26, 2013, h b <[email protected]>
> > > wrote:
> > > > >> > > > >>> >> > The quick responses flowing are very encouraging.
> > > Thanks
> > > > >> > Tejas.
> > > > >> > > > >>> >> > Tejas, as I mentioned earlier, in fact I actually
> ran
> > > it
> > > > >> step
> > > > >> > by
> > > > >> > > > >>> step.
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> > So first I ran the inject command and then the
> readdb
> > > > with
> > > > >> > dump
> > > > >> > > > >>> option
> > > > >> > > > >>> >> and
> > > > >> > > > >>> >> > did not see anything in the dump files, that leads
> me
> > > to
> > > > >> say
> > > > >> > > that
> > > > >> > > > >>> the
> > > > >> > > > >>> >> > inject did not work.I verified the regex-urlfilter
> > and
> > > > made
> > > > >> > sure
> > > > >> > > > >>> that
> > > > >> > > > >>> my
> > > > >> > > > >>> >> > url is not getting filtered.
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> > I agree that the second link is about configuring
> > HBase
> > > > as
> > > > >> a
> > > > >> > > > >>> storageDB.
> > > > >> > > > >>> >> > However, I do not have Hbase installed and dont
> > foresee
> > > > >> > getting
> > > > >> > > it
> > > > >> > > > >>> >> > installed any sooner, hence using HBase for storage
> > is
> > > > not
> > > > >> a
> > > > >> > > > option,
> > > > >> > > > >>> so I
> > > > >> > > > >>> >> > am going to have to stick to Gora with memory
> store.
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
> > > > >> > > > >>> [email protected]
> > > > >> > > > >>> >> >wrote:
> > > > >> > > > >>> >> >
> > > > >> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <
> > > [email protected]>
> > > > >> > wrote:
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> > Thanks for the response Lewis.
> > > > >> > > > >>> >> >> > I did read these links, I mostly followed the
> > first
> > > > link
> > > > >> > and
> > > > >> > > > >>> tried
> > > > >> > > > >>> >> both
> > > > >> > > > >>> >> >> the
> > > > >> > > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave
> me
> > > null
> > > > >> > > pointer
> > > > >> > > > >>> >> exception
> > > > >> > > > >>> >> >> on
> > > > >> > > > >>> >> >> > solr, so I figured that I should first deal with
> > > > getting
> > > > >> > the
> > > > >> > > > >>> crawl
> > > > >> > > > >>> >> part
> > > > >> > > > >>> >> >> to
> > > > >> > > > >>> >> >> > work and then deal with solr indexing. Hence I
> > went
> > > > >> back to
> > > > >> > > > >>> trying
> > > > >> > > > >>> it
> > > > >> > > > >>> >> >> > stepwise.
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> You should try running the crawl using individual
> > > > commands
> > > > >> > and
> > > > >> > > > see
> > > > >> > > > >>> where
> > > > >> > > > >>> >> >> the problem is. The nutch tutorial which Lewis
> > pointed
> > > > >> you to
> > > > >> > > had
> > > > >> > > > >>> those
> > > > >> > > > >>> >> >> commands. Even peeking into the bin/crawl script
> > would
> > > > >> also
> > > > >> > > help
> > > > >> > > > >>> as it
> > > > >> > > > >>> >> >> calls the nutch commands.
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> > As for the second link, it is more about using
> > HBase
> > > > as
> > > > >> > store
> > > > >> > > > >>> instead
> > > > >> > > > >>> >> of
> > > > >> > > > >>> >> >> > gora. This is not really a option for me yet,
> > cause
> > > my
> > > > >> grid
> > > > >> > > > does
> > > > >> > > > >>> not
> > > > >> > > > >>> >> have
> > > > >> > > > >>> >> >> > hbase installed yet. Getting it done is not much
> > > under
> > > > >> my
> > > > >> > > > control
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> HBase is one of the datastores supported by Apache
> > > Gora.
> > > > >> That
> > > > >> > > > >>> tutorial
> > > > >> > > > >>> >> >> speaks about how to configure Nutch (actually
> Gora)
> > to
> > > > use
> > > > >> > > HBase
> > > > >> > > > >>> as a
> > > > >> > > > >>> >> >> backend. So, its wrong to say that the tutorial
> was
> > > > about
> > > > >> > HBase
> > > > >> > > > and
> > > > >> > > > >>> not
> > > > >> > > > >>> >> >> Gora.
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> > the FAQ link is the one I had not gone through
> > > until I
> > > > >> > > checked
> > > > >> > > > >>> your
> > > > >> > > > >>> >> >> > response, but I do not find answers to any of my
> > > > >> questions
> > > > >> > > > >>> >> >> > (directly/indirectly) in it.
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> Ok
> > > > >> > > > >>> >> >>
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> >
> > > > >> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John
> > > Mcgibbney
> > > > <
> > > > >> > > > >>> >> >> > > *Lewis*
> > > > >> > > > >>> >>
> > > > >> > > > >>> >
> > > > >> > > > >>>
> > > > >> > > > >>> --
> > > > >> > > > >>> *Lewis*
> > > > >> > > > >>>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Questions/issues with nutch

Reply via email to