Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Tom Running Mon, 22 Feb 2016 18:00:03 -0800

Got errors:  when run this command:
 ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1
Have any idea where to go from here?


thank you.
Tom



ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11
CrawlDB update for TestCrawler
/home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler
DbUpdaterJob: starting at 2016-02-22 20:22:49
DbUpdaterJob: batchId: 1456190521-30847
DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09
Indexing TestCrawler on SOLR index -> http://localhost:8983/solr
/home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D solr.server.url=
http://localhost:8983/solr -all -crawlId TestCrawler
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed:
name=[TestCrawler]Indexer, jobid=job_local1592190856_0001
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)




*Error running:  /home/nutch/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr <http://localhost:8983/solr>
-all -crawlId TestCrawlerFailed with exit value 255.*

On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <[email protected]> wrote:

> CrawlID put any number
> Number of rounds >=1
> Seed dir and solr URL is proper
>
> On Mon, 22 Feb 2016, 12:17 Tom Running <[email protected]> wrote:
>
> > Binoy,
> >
> > I do see the information on the console and also lot of information in
> > hbase.
> >
> > I tried ./crawl  but not quite sure where to location the following
> > information:
> >
> > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> >
> > seedDir   ../urls/seed.txt  ?
> > crawID  ?
> > solrUrl     I am guessing this will be http://localhost:8983/solr/
> > numberOfRounds  ?
> >
> > Could you provide some advice on how to determine the above information.
> >
> >
> > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> >
> > Thanks,
> > Tom
> >
> >
> >
> > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal <[email protected]>
> > wrote:
> >
> > > When you run the inject and generate commands, in the console output do
> > you
> > > see your site being added?
> > > Also while fetching and parsing you should be able to see the number of
> > > successful fetches and parse actions in your console. Ideally this
> should
> > > be equal to or more than the number of sites you've put in the seed.txt
> > > file.
> > > If this is not the case then there is some issue with either your
> > seed.txt
> > > file or the regex-urlfilter file.
> > >
> > > While running the crawl command, you doing need to index to solr
> > > separately. The command will do it for you.
> > > Run ./crawl to see usage instructions.
> > >
> > > On Mon, 22 Feb 2016, 11:41 Tom Running <[email protected]> wrote:
> > >
> > > > Yes, I did ran these before run ./nutch solrindex
> > > > http://localhost:8983/solr/ -all and get nothing.
> > > >
> > > >
> > > > From /home/nutch/runtime/local/bin/
> > > >
> > > > ./nutch inject ../urls/seed.txt
> > > > ./nutch readdb
> > > > ./nutch generate -topN 2500
> > > > ./nutch fetch -all
> > > > ./nutch parse -all
> > > > ./nutch updatedb
> > > >
> > > > Did not run the crawl command.
> > > >
> > > > Would I just run ./crawl ??
> > > > then run this again ./nutch solrindex http://localhost:8983/solr/
> -all
> > > >
> > > > Thank you very much for response to my questions.
> > > >
> > > > Tom
> > > >
> > > >
> > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Just to be clear, you did run the preceding nutch commands to
> inject,
> > > > > generate, fetch and parse the URLs right?
> > > > >
> > > > > Additionally try with the ./crawl command to directly crawl and
> index
> > > > > everything to solr without having to manually run all the steps.
> > > > >
> > > > > On Mon, 22 Feb 2016, 07:24 Tom Running <[email protected]>
> > wrote:
> > > > >
> > > > > > I am trying to get Nutch to run solrindex and having problem.  I
> am
> > > > using
> > > > > > the following instruction from
> > > > > > this document http://wiki.apache.org/nutch/Nutch2Tutorial.
> > > Everything
> > > > > > are working except when I ran the following command.
> > > > > >
> > > > > >
> > > > > > *./nutch solrindex http://localhost:8983/solr <
> > > > > http://localhost:8983/solr>
> > > > > > -all*
> > > > > >
> > > > > >
> > > > > >
> > > > > > ****** it came back with the following info  *****
> > > > > > ****** It seems to have problem with indexing ****
> > > > > > IndexingJob: starting
> > > > > > Active IndexWriters :
> > > > > > SOLRIndexWriter
> > > > > >         solr.server.url : URL of the SOLR instance (mandatory)
> > > > > >         solr.commit.size : buffer size when sending to SOLR
> > (default
> > > > > 1000)
> > > > > >         solr.mapping.file : name of the mapping file for fields
> > > > (default
> > > > > > solrindex-mapping.xml)
> > > > > >         solr.auth : use authentication* (default false)*
> > > > > >         solr.auth.username : username for authentication
> > > > > >         solr.auth.password : password for authentication
> > > > > > IndexingJob: done.
> > > > > >
> > > > > >
> > > > > > When I launch the SOLR Web UI interface can not query or find any
> > > > things
> > > > > > under the default collection1 or the
> gettingstarted_shard1_replica1
> > > or
> > > > > > gettingstarted_shard2_replica1
> > > > > >
> > > > > >
> > > > > > I have also tried with this option (with the colletion1) and
> still
> > > not
> > > > > > able to query anything.
> > > > > > ./nutch solrindex http://localhost:8983/solr/collection1 -all
> > > > > >
> > > > > >
> > > > > >
> > > > > > After download SOLR 4.10.3 and start it as it with command
> > > > > > /home/solr/bin/solr start -e cloud -noprompt
> > > > > >
> > > > > > I did not modify any configuration file not posting any file or
> > > > directory
> > > > > > from within SOLR. I am assuming this command ./nutch solrindex
> > > > > > http://localhost:8983/solr/collection1 will do all the posting
> and
> > > > index
> > > > > > for SOLR.
> > > > > >
> > > > > > Any ideas what am I missing here.  Any advice where to go from
> here
> > > > > would
> > > > > > be greatly appreciate.
> > > > > >
> > > > > > I Did tried copy /nutch/runtime/local/conf/*.*   into SOLR and it
> > did
> > > > not
> > > > > > make any different.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > > --
> > > > > Regards,
> > > > > Binoy Dalal
> > > > >
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Reply via email to