Re: Questions/issues with nutch

h b Mon, 01 Jul 2013 10:21:29 -0700

Hi,
I started to inspect the content of the crawled html.
I have 2 urls in my seed.txt. So I should just have 2 documents in my solr
response, right? I dropped 'webpage' database and recreated it by running
just a single iteration of inject,generate,fetch,parse,solr
However, I am seeing 8 different documents and the url key in these does
not even match the url from my seed. What could be going wrong here. It
almost feels like the crawl crawled entirely different page than what I
requested. I verified the urls from my seed.txt and they do not redirect.



On Sun, Jun 30, 2013 at 8:40 AM, h b <[email protected]> wrote:

> Because we have a separate non Java legacy process that would take care of
> the parsing, and it requires raw html. It's more of a process reasoning
> than anything else.
> On Jun 30, 2013 8:06 AM, "Tejas Patil" <[email protected]> wrote:
>
>> I am curious to know why do needed the raw html content instead of parsed
>> stuff. Search engines are meant to index parsed text. The data to be
>> stored
>> and indexed reduces after parsing.
>>
>>
>> On Sat, Jun 29, 2013 at 9:20 PM, h b <[email protected]> wrote:
>>
>> > Thanks Tejas,
>> > I have just 2 urls in my seed file, and the second run of fetch ran for
>> a
>> > few hours. I will verify if I got what I wanted.
>> >
>> > Regarding the raw html, its a ugly hack, so I did not really create a
>> > patch. But this is what I did
>> >
>> >
>> > In
>> >
>> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
>> > getParse method,
>> >
>> >       //text = sb.toString();
>> >       text = new String(page.getContent().array());
>> >
>> > Would be nice to make this as a configuration in the plugin xml.
>> >
>> > Other thing I will try soon is to extract the content only for a
>> specific
>> > depth.
>> >
>> >
>> >
>> > On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil <[email protected]
>> > >wrote:
>> >
>> > > Yes. Nutch would parse the HTML and extract the content out of it.
>> > Tweaking
>> > > around the code surrounding the parser would have made that happen. If
>> > you
>> > > did something else, would you mind sharing it ?
>> > >
>> > > The "depth" is used by the Crawl class in 1.x which is deprecated in
>> 2.x.
>> > > Use bin/crawl instead.
>> > > While running the "bin/crawl" script, the "<numberOfRounds>" option is
>> > > nothing but the depth till which you want the crawling to be
>> performed.
>> > >
>> > > If you want to use the individual commands instead, run generate ->
>> fetch
>> > > -> parse -> update multiple times. The crawl script internally does
>> the
>> > > same thing.
>> > > eg. If you want to fetch till depth 3, this is how you could do:
>> > > inject -> (generate -> fetch -> parse -> update)
>> > >           -> (generate -> fetch -> parse -> update)
>> > >           -> (generate -> fetch -> parse -> update)
>> > >                -> solrindex
>> > >
>> > > On Fri, Jun 28, 2013 at 7:24 PM, h b <[email protected]> wrote:
>> > >
>> > > > Ok, I tweaked the code a bit to extract the html as is from the
>> parser,
>> > > to
>> > > > realize that it is too much of a text and too much depth of
>> crawling.
>> > So
>> > > I
>> > > > am looking to see if I can somehow limit the depth. Nutch 1.x docs
>> > > mention
>> > > > about the -depth parameter. However, I do not see this in the
>> > > > nutch-default.xml under Nutch 2.x. The -topN is used for number of
>> > links
>> > > > per depth. So for Nutch 2.x where/how do I set the depth?
>> > > >
>> > > >
>> > > > On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote:
>> > > >
>> > > > > Ok, SO i also got this work with Solr 4 no errors, I think the key
>> > was
>> > > > not
>> > > > > using a crawl id.
>> > > > > I had to comment the updatelog in solrconfig.xml because I got
>> some
>> > > > > "_version_" related error.
>> > > > >
>> > > > > My next questions is, my solr document, or for that matter even
>> the
>> > > hbase
>> > > > > value of the html content is 'not html'. It appears that nutch is
>> > > > > extracting out text only. How do I retain the html content "as
>> is".
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil <
>> > > [email protected]
>> > > > >wrote:
>> > > > >
>> > > > >> Kewl !!
>> > > > >>
>> > > > >> I wonder why "org.apache.solr.common.SolrException: undefined
>> field
>> > > > text"
>> > > > >> happens.. Anybody who can throw light on this ?
>> > > > >>
>> > > > >>
>> > > > >> On Fri, Jun 28, 2013 at 10:45 AM, h b <[email protected]> wrote:
>> > > > >>
>> > > > >> > Thanks Tejas
>> > > > >> > I tried these steps, One step I added, was updatedb
>> > > > >> >
>> > > > >> > *bin/nutch updatedb*
>> > > > >> >
>> > > > >> > Just to be consistent with the doc, and your suggestion on some
>> > > other
>> > > > >> > thread, I used solr 3.6 instead of 4.x
>> > > > >> > I copied the schema.xml from nutch/conf (rootlevel) and started
>> > > solr.
>> > > > It
>> > > > >> > failed with
>> > > > >> >
>> > > > >> > SEVERE: org.apache.solr.common.SolrException: undefined field
>> text
>> > > > >> >
>> > > > >> >
>> > > > >> > One of the google thread, suggested I ignore this error, so I
>> > > ignored
>> > > > >> and
>> > > > >> > indexed anyway
>> > > > >> >
>> > > > >> > So now I got it to work. Playing some more with the queries
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <
>> > > > [email protected]
>> > > > >> > >wrote:
>> > > > >> >
>> > > > >> > > The "storage.schema.webpage" seems messed up but I don't have
>> > > ample
>> > > > >> time
>> > > > >> > > now to look into it. Here is what I would suggest to get
>> things
>> > > > >> working:
>> > > > >> > > *
>> > > > >> > > *
>> > > > >> > > *[1] Remove all the old data from HBase*
>> > > > >> > >
>> > > > >> > > (I assume that HBase is running while you do this)
>> > > > >> > > *cd $HBASE_HOME*
>> > > > >> > > *./bin/hbase shell
>> > > > >> > > *
>> > > > >> > > In the HBase shell, use "list" to see all the tables, delete
>> all
>> > > of
>> > > > >> those
>> > > > >> > > related to Nutch (ones named as *webpage).
>> > > > >> > > Remove them using "disable" and "drop" commands.
>> > > > >> > >
>> > > > >> > > eg. if one of the tables is "webpage", you would run this:
>> > > > >> > > *disable 'webpage'
>> > > > >> > > *
>> > > > >> > > *drop 'webpage'*
>> > > > >> > > * *
>> > > > >> > >
>> > > > >> > > *[2] Run crawl*
>> > > > >> > > I assume that you have not changed "storage.schema.webpage"
>> is
>> > > > >> > > nutch-site.xml and nutch-default.xml. If yes, revert it to:
>> > > > >> > >
>> > > > >> > > *<property>*
>> > > > >> > > *  <name>storage.schema.webpage</**name>*
>> > > > >> > > *  <value>webpage</value>*
>> > > > >> > > *  <description>This value holds the schema name used for
>> Nutch
>> > > web
>> > > > >> db.*
>> > > > >> > > *  Note that Nutch ignores the value in the gora mapping
>> files,
>> > > and
>> > > > >> uses*
>> > > > >> > > *  this as the webpage schema name.*
>> > > > >> > > *  </description>*
>> > > > >> > > *</property>*
>> > > > >> > >
>> > > > >> > > Run crawl commands:
>> > > > >> > > *bin/nutch inject urls/*
>> > > > >> > > *bin/nutch generate -topN 50000  -noFilter -adddays 0*
>> > > > >> > > *bin/nutch fetch -all -threads 5  *
>> > > > >> > > *bin/nutch parse -all *
>> > > > >> > >
>> > > > >> > > *[3] Perform indexing*
>> > > > >> > > I assume that you have Solr setup and
>> NUTCH_HOME/conf/schema.xml
>> > > > >> copied
>> > > > >> > in
>> > > > >> > > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for
>> > > details.
>> > > > >> > > Start solr and run the indexing command:
>> > > > >> > > *bin/nutch solrindex  $SOLR_URL -all *
>> > > > >> > >
>> > > > >> > > [0] : http://wiki.apache.org/nutch/NutchTutorial
>> > > > >> > >
>> > > > >> > > Thanks,
>> > > > >> > > Tejas
>> > > > >> > >
>> > > > >> > > On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]>
>> wrote:
>> > > > >> > >
>> > > > >> > > > Ok, so avro did not work quite well for me, I got a test
>> grid
>> > > with
>> > > > >> > hbase,
>> > > > >> > > > and I started using that for now. All steps ran without
>> errors
>> > > > and I
>> > > > >> > see
>> > > > >> > > my
>> > > > >> > > > crawled doc in hbase.
>> > > > >> > > > However, after running the solr integration, and querying
>> > solr,
>> > > I
>> > > > >> get
>> > > > >> > > back
>> > > > >> > > > nothing. Index files look very tiny. The one thing I noted
>> is
>> > a
>> > > > >> message
>> > > > >> > > > during almost every step
>> > > > >> > > >
>> > > > >> > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and
>> > nameclass
>> > > > >> match
>> > > > >> > but
>> > > > >> > > > mismatching table names  mappingfile schema is 'webpage' vs
>> > > actual
>> > > > >> > schema
>> > > > >> > > > 'crawl2_webpage' , assuming they are the same.
>> > > > >> > > >
>> > > > >> > > > This looks suspicious and I think this is the one causing
>> the
>> > > solr
>> > > > >> > index
>> > > > >> > > to
>> > > > >> > > > be empty. Googling suggested I should edit the
>> > > nutch-default,xml,
>> > > > I
>> > > > >> > tried
>> > > > >> > > > and rebuilt the job but no luck with this message.
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]>
>> > wrote:
>> > > > >> > > >
>> > > > >> > > > > Ok, I ran a ant, ant jar and ant job and that seems to
>> have
>> > > > >> picked up
>> > > > >> > > the
>> > > > >> > > > > config changes.
>> > > > >> > > > > Now, the inject output shows that it is using AvroStore
>> as
>> > > Gora
>> > > > >> > > storage.
>> > > > >> > > > >
>> > > > >> > > > > Now I am getting Nullpointer on
>> > > > >> > > > >
>> > > > >> > > > > java.lang.NullPointerException
>> > > > >> > > > >         at
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70)
>> > > > >> > > > >         at
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91)
>> > > > >> > > > >         at
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521)
>> > > > >> > > > >         at
>> > > > >> > > >
>> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636)
>> > > > >> > > > >         at
>> > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
>> > > > >> > > > >         at
>> > > org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>> > > > >> > > > >         at
>> > java.security.AccessController.doPrivileged(Native
>> > > > >> Method)
>> > > > >> > > > >         at
>> > javax.security.auth.Subject.doAs(Subject.java:396)
>> > > > >> > > > >         at
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
>> > > > >> > > > >         at
>> > org.apache.hadoop.mapred.Child.main(Child.java:264)
>> > > > >> > > > >
>> > > > >> > > > > which does not look like nutch related. I will work on
>> this
>> > > and
>> > > > >> write
>> > > > >> > > > back
>> > > > >> > > > > if I get stuck on something else, or will write back if I
>> > > > succeed.
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]>
>> > > wrote:
>> > > > >> > > > >
>> > > > >> > > > >> Hi Lewis,
>> > > > >> > > > >>
>> > > > >> > > > >> Sorry for missing that one. So I update the top level
>> conf
>> > > and
>> > > > >> > rebuild
>> > > > >> > > > >> the job.
>> > > > >> > > > >>
>> > > > >> > > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml
>> > > > >> > > > >>
>> > > > >> > > > >> ......
>> > > > >> > > > >>   <property>
>> > > > >> > > > >>     <name>storage.data.store.class</name>
>> > > > >> > > > >>     <value>org.apache.gora.avro.store.AvroStore</value>
>> > > > >> > > > >>   </property>
>> > > > >> > > > >> ......
>> > > > >> > > > >>
>> > > > >> > > > >> cd ~/nutch/apache-nutch-2.2/
>> > > > >> > > > >> ant job
>> > > > >> > > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/
>> > > > >> > > > >>
>> > > > >> > > > >>
>> > > > >> > > > >> bin/nutch inject urls -crawlId crawl1
>> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
>> > > starting
>> > > > >> at
>> > > > >> > > > >> 2013-06-27 17:12:01
>> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
>> > > > Injecting
>> > > > >> > > urlDir:
>> > > > >> > > > >> urls
>> > > > >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob:
>> > Using
>> > > > >> class
>> > > > >> > > > >> org.apache.gora.memory.store.MemStore as the Gora
>> storage
>> > > > class.
>> > > > >> > > > >>
>> > > > >> > > > >> It still shows me MemStore.
>> > > > >> > > > >>
>> > > > >> > > > >> In the jobtracker I see a [crawl1]inject urls job does
>> not
>> > > have
>> > > > >> > > > >> urls_injected property
>> > > > >> > > > >> I have a *db.score.injected* 1.0, but dont think that is
>> > > > >> anything to
>> > > > >> > > say
>> > > > >> > > > >> about urls injected.
>> > > > >> > > > >>
>> > > > >> > > > >>
>> > > > >> > > > >>
>> > > > >> > > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney <
>> > > > >> > > > >> [email protected]> wrote:
>> > > > >> > > > >>
>> > > > >> > > > >>> Hi,
>> > > > >> > > > >>> Please re-read my mail.
>> > > > >> > > > >>> If you are using the deploy directory e.g. running on a
>> > > hadoop
>> > > > >> > > cluster,
>> > > > >> > > > >>> then make sure to edit nutch-site.xml from within the
>> top
>> > > > level
>> > > > >> > conf
>> > > > >> > > > >>> directory _not_ the conf directory in runtime/local.
>> > > > >> > > > >>> If you look at the ant runtime target in the build
>> script
>> > > you
>> > > > >> will
>> > > > >> > > see
>> > > > >> > > > >>> the
>> > > > >> > > > >>> code which generates the runtime directory structure.
>> > > > >> > > > >>> Make changes to conf/nutch-site.xml, build the job jar,
>> > > > >> navigate to
>> > > > >> > > > >>> runtime/deploy, run the code.
>> > > > >> > > > >>> It's easier to make the job jar and scripts in deploy
>> > > > available
>> > > > >> to
>> > > > >> > > the
>> > > > >> > > > >>> job
>> > > > >> > > > >>> tracker.
>> > > > >> > > > >>> You also didn't comment on the counters for the inject
>> > job.
>> > > Do
>> > > > >> you
>> > > > >> > > see
>> > > > >> > > > >>> any?
>> > > > >> > > > >>> Best
>> > > > >> > > > >>> Lewis
>> > > > >> > > > >>>
>> > > > >> > > > >>> On Wednesday, June 26, 2013, h b <[email protected]>
>> > wrote:
>> > > > >> > > > >>> > Here is an example of what I am saying about the
>> config
>> > > > >> changes
>> > > > >> > not
>> > > > >> > > > >>> taking
>> > > > >> > > > >>> > effect.
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > cd runtime/deploy
>> > > > >> > > > >>> > cat ../local/conf/nutch-site.xml
>> > > > >> > > > >>> > ......
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >   <property>
>> > > > >> > > > >>> >     <name>storage.data.store.class</name>
>> > > > >> > > > >>> >
>> <value>org.apache.gora.avro.store.AvroStore</value>
>> > > > >> > > > >>> >   </property>
>> > > > >> > > > >>> > .....
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > cd ../..
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > ant job
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > cd runtime/deploy
>> > > > >> > > > >>> > bin/nutch inject urls -crawlId crawl1
>> > > > >> > > > >>> > .....
>> > > > >> > > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob:
>> InjectorJob:
>> > > Using
>> > > > >> > class
>> > > > >> > > > >>> > org.apache.gora.memory.store.MemStore as the Gora
>> > storage
>> > > > >> class.
>> > > > >> > > > >>> > .....
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > So the nutch-site.xml was changed to use AvroStore as
>> > > > storage
>> > > > >> > class
>> > > > >> > > > and
>> > > > >> > > > >>> job
>> > > > >> > > > >>> > was rebuilt, and I reran inject, the output of which
>> > still
>> > > > >> shows
>> > > > >> > > that
>> > > > >> > > > >>> it
>> > > > >> > > > >>> is
>> > > > >> > > > >>> > trying to use Memstore.
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >
>> > > > >> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John
>> Mcgibbney <
>> > > > >> > > > >>> > [email protected]> wrote:
>> > > > >> > > > >>> >
>> > > > >> > > > >>> >> The Gora MemStore was introduced to deal
>> predominantly
>> > > with
>> > > > >> test
>> > > > >> > > > >>> scenarios.
>> > > > >> > > > >>> >> This is justified as the 2.x code is pulled nightly
>> and
>> > > > after
>> > > > >> > > every
>> > > > >> > > > >>> commit
>> > > > >> > > > >>> >> and tested.
>> > > > >> > > > >>> >> It is nnot thread safe and should not be used
>> (until we
>> > > fix
>> > > > >> some
>> > > > >> > > > >>> issues)
>> > > > >> > > > >>> >> for any kind of serious deployment.
>> > > > >> > > > >>> >> From your inject task on the job tracker, you will
>> be
>> > > able
>> > > > to
>> > > > >> > see
>> > > > >> > > > >>> >> 'urls_injected' counters which represent the number
>> of
>> > > urls
>> > > > >> > > actually
>> > > > >> > > > >>> >> persisted through Gora into the datastore.
>> > > > >> > > > >>> >> I understand that HBase is not an option. Gora
>> should
>> > > also
>> > > > >> > support
>> > > > >> > > > >>> writing
>> > > > >> > > > >>> >> the output into Avro sequence files... which can be
>> > > pumped
>> > > > >> into
>> > > > >> > > > hdfs.
>> > > > >> > > > >>> We
>> > > > >> > > > >>> >> have done some work on this so I suppose that right
>> now
>> > > is
>> > > > as
>> > > > >> > > good a
>> > > > >> > > > >>> time
>> > > > >> > > > >>> >> as any for you to try it out.
>> > > > >> > > > >>> >> use the default datastore as
>> > > > >> > org.apache.gora.avro.store.AvroStore
>> > > > >> > > I
>> > > > >> > > > >>> think.
>> > > > >> > > > >>> >> You can double check by looking into gora.properties
>> > > > >> > > > >>> >> As a note, youu should use nutch-site.xml within the
>> > top
>> > > > >> level
>> > > > >> > > conf
>> > > > >> > > > >>> >> directory for all your Nutch configuration. You
>> should
>> > > then
>> > > > >> > > create a
>> > > > >> > > > >>> new
>> > > > >> > > > >>> >> job jar for use in hadoop by calling 'ant job' after
>> > the
>> > > > >> changes
>> > > > >> > > are
>> > > > >> > > > >>> made.
>> > > > >> > > > >>> >> hth
>> > > > >> > > > >>> >> Lewis
>> > > > >> > > > >>> >>
>> > > > >> > > > >>> >> On Wednesday, June 26, 2013, h b <[email protected]>
>> > > wrote:
>> > > > >> > > > >>> >> > The quick responses flowing are very encouraging.
>> > > Thanks
>> > > > >> > Tejas.
>> > > > >> > > > >>> >> > Tejas, as I mentioned earlier, in fact I actually
>> ran
>> > > it
>> > > > >> step
>> > > > >> > by
>> > > > >> > > > >>> step.
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> > So first I ran the inject command and then the
>> readdb
>> > > > with
>> > > > >> > dump
>> > > > >> > > > >>> option
>> > > > >> > > > >>> >> and
>> > > > >> > > > >>> >> > did not see anything in the dump files, that
>> leads me
>> > > to
>> > > > >> say
>> > > > >> > > that
>> > > > >> > > > >>> the
>> > > > >> > > > >>> >> > inject did not work.I verified the regex-urlfilter
>> > and
>> > > > made
>> > > > >> > sure
>> > > > >> > > > >>> that
>> > > > >> > > > >>> my
>> > > > >> > > > >>> >> > url is not getting filtered.
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> > I agree that the second link is about configuring
>> > HBase
>> > > > as
>> > > > >> a
>> > > > >> > > > >>> storageDB.
>> > > > >> > > > >>> >> > However, I do not have Hbase installed and dont
>> > foresee
>> > > > >> > getting
>> > > > >> > > it
>> > > > >> > > > >>> >> > installed any sooner, hence using HBase for
>> storage
>> > is
>> > > > not
>> > > > >> a
>> > > > >> > > > option,
>> > > > >> > > > >>> so I
>> > > > >> > > > >>> >> > am going to have to stick to Gora with memory
>> store.
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil <
>> > > > >> > > > >>> [email protected]
>> > > > >> > > > >>> >> >wrote:
>> > > > >> > > > >>> >> >
>> > > > >> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <
>> > > [email protected]>
>> > > > >> > wrote:
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> > Thanks for the response Lewis.
>> > > > >> > > > >>> >> >> > I did read these links, I mostly followed the
>> > first
>> > > > link
>> > > > >> > and
>> > > > >> > > > >>> tried
>> > > > >> > > > >>> >> both
>> > > > >> > > > >>> >> >> the
>> > > > >> > > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave
>> me
>> > > null
>> > > > >> > > pointer
>> > > > >> > > > >>> >> exception
>> > > > >> > > > >>> >> >> on
>> > > > >> > > > >>> >> >> > solr, so I figured that I should first deal
>> with
>> > > > getting
>> > > > >> > the
>> > > > >> > > > >>> crawl
>> > > > >> > > > >>> >> part
>> > > > >> > > > >>> >> >> to
>> > > > >> > > > >>> >> >> > work and then deal with solr indexing. Hence I
>> > went
>> > > > >> back to
>> > > > >> > > > >>> trying
>> > > > >> > > > >>> it
>> > > > >> > > > >>> >> >> > stepwise.
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> You should try running the crawl using individual
>> > > > commands
>> > > > >> > and
>> > > > >> > > > see
>> > > > >> > > > >>> where
>> > > > >> > > > >>> >> >> the problem is. The nutch tutorial which Lewis
>> > pointed
>> > > > >> you to
>> > > > >> > > had
>> > > > >> > > > >>> those
>> > > > >> > > > >>> >> >> commands. Even peeking into the bin/crawl script
>> > would
>> > > > >> also
>> > > > >> > > help
>> > > > >> > > > >>> as it
>> > > > >> > > > >>> >> >> calls the nutch commands.
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> > As for the second link, it is more about using
>> > HBase
>> > > > as
>> > > > >> > store
>> > > > >> > > > >>> instead
>> > > > >> > > > >>> >> of
>> > > > >> > > > >>> >> >> > gora. This is not really a option for me yet,
>> > cause
>> > > my
>> > > > >> grid
>> > > > >> > > > does
>> > > > >> > > > >>> not
>> > > > >> > > > >>> >> have
>> > > > >> > > > >>> >> >> > hbase installed yet. Getting it done is not
>> much
>> > > under
>> > > > >> my
>> > > > >> > > > control
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> HBase is one of the datastores supported by
>> Apache
>> > > Gora.
>> > > > >> That
>> > > > >> > > > >>> tutorial
>> > > > >> > > > >>> >> >> speaks about how to configure Nutch (actually
>> Gora)
>> > to
>> > > > use
>> > > > >> > > HBase
>> > > > >> > > > >>> as a
>> > > > >> > > > >>> >> >> backend. So, its wrong to say that the tutorial
>> was
>> > > > about
>> > > > >> > HBase
>> > > > >> > > > and
>> > > > >> > > > >>> not
>> > > > >> > > > >>> >> >> Gora.
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> > the FAQ link is the one I had not gone through
>> > > until I
>> > > > >> > > checked
>> > > > >> > > > >>> your
>> > > > >> > > > >>> >> >> > response, but I do not find answers to any of
>> my
>> > > > >> questions
>> > > > >> > > > >>> >> >> > (directly/indirectly) in it.
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> Ok
>> > > > >> > > > >>> >> >>
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> >
>> > > > >> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John
>> > > Mcgibbney
>> > > > <
>> > > > >> > > > >>> >> >> > > *Lewis*
>> > > > >> > > > >>> >>
>> > > > >> > > > >>> >
>> > > > >> > > > >>>
>> > > > >> > > > >>> --
>> > > > >> > > > >>> *Lewis*
>> > > > >> > > > >>>
>> > > > >> > > > >>
>> > > > >> > > > >>
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Questions/issues with nutch

Reply via email to