Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Binoy Dalal Tue, 23 Feb 2016 01:00:57 -0800

There's some issues with your schema.XML file for solr.
Did you copy over the schema file from nutch's config dir to your solr
core's conf?


As you can see from the unknown field error, the field mera_description is
missing from you solr schema, and so when a document which this field is
indexed by solr, solr doesn't recognise the meta_description field and
throws an error.

So either create this field or copy over nutch's schema file to you solr
core's conf.

On Tue, 23 Feb 2016, 14:08 Tom Running <runningt...@gmail.com> wrote:

> Here is the solr and nutch log files.  I found a few errors in these logs
> not quite sure how to fix it.
> Perhaps, I am not quite understand how Nutch, Solr and Hbase work together
> that why it is so difficult to get them to work together correctly.
>
> How does these three packages working together?  This is how I understand
> it.  Please correct me if I am not in the right track.
>
> Use Nutch 2.3.1  to Crawl data, using hbase (.98.8) as a database for
> Nutch's db and Nutch's crawling contents, using nutch solrindex
> http://localhost:8983/solr/ -all      to tell SOLR (4.10.3) to index
> Nutch's crawl data that resided in the Hbase database?
>
> How does SOLR know where to get it data. In this case, the data that we
> want SORL to use is in the Hbase table.  Do I have to perform a POST that
> point to the Hbase or some sort??
>
>
> Thank you for looking to this problem.  I am going crazy (-:
>
>
>
> ***********  SOLR's log file **************************
>
> INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener;
> QuerySenderListener sending requests to Searcher@6ff86671[collection1]
> main{StandardDir
> ectoryReader(segments_1:1:nrt)}
> INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener;
> QuerySenderListener done.
> INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.SolrCore;
> [collection1] Registered new searcher Searcher@6ff86671[collection1]
> main{StandardDirectoryReader(
> segments_1:1:nrt)}
> INFO  - 2016-02-23 02:47:15.952; org.apache.solr.core.SolrCore;
> [collection1] Closing main searcher on request.
> INFO  - 2016-02-23 02:47:15.953;
> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
> path=/admin/cores params={action=RELOAD&_=1456213620483&core=coll
> ection1&wt=json} status=0 QTime=1375
> INFO  - 2016-02-23 02:47:15.986;
> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
> path=/admin/cores params={_=1456213621893&wt=json} status=0 QTime
> =1
> [root@localhost logs]#          1959432 [qtp969637605-11] INFO
> org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
> webapp=/solr path=/update par             ams={wt=javabin&version=2} {} 0
> 55
> 1959433 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
> org.apache.solr.common.SolrException:
> * ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
> tion'        at *
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
>         at
>
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
>
>
>         at
>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>         at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>         at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>         at java.lang.Thread.run(Thread.java:745)
>
> 1959465 [qtp969637605-11] INFO
> org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
> webapp=/solr path=/update params={wt=javabin&version=2} {} 0              2
> 1959466 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
> org.apache.solr.common.SolrException:
>
>
> *ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
> tion'        at
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
> at
>
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)*
>
>
>
>
>
> ****************************************************************
>
> NUTCH (hadoop.log)
>
> 2016-02-23 02:58:40,659 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2016-02-23 02:58:40,701
> * WARN  store.HBaseStore - Mismatching schema's names. Mappingfile schema:
> 'webpage'. PersistentClass schema's name: 'webpage_webpage'Assuming they
> are the same.*
> 2016-02-23 02:58:41,158 INFO  solr.SolrIndexWriter - Adding 1 documents
> 2016-02-23 02:58:41,574 INFO  solr.SolrIndexWriter - Adding 1 documents
> 2016-02-23 02:58:41,614 WARN  mapred.LocalJobRunner -
> job_local1682084779_0001
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> *ERROR: [doc=com.galco.www:http] unknown field 'meta_description'        at
> *
>
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> Caused by:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> ERROR: [doc=com.galco.www:http] unknown field 'meta_description'
>         at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
>         at
>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
>         at
>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
>         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
>         at
>
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97)
>         at
> org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
>         at
>
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
>         at
>
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>         at
>
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-23 02:58:41,653* ERROR indexer.IndexingJob - SolrIndexerJob:
> java.lang.RuntimeException: job failed: name=[webpage]Indexer,
> jobid=job_local1682084779_0001*
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
>
>
>
> On Mon, Feb 22, 2016 at 9:52 PM, Binoy Dalal <binoydala...@gmail.com>
> wrote:
>
> > What errors do you see in hadoop.log and solr's solr.log?
> > Post that stack trace.
> >
> > On Tue, 23 Feb 2016, 07:29 Tom Running <runningt...@gmail.com> wrote:
> >
> > > Got errors:  when run this command:
> > >  ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1
> > > Have any idea where to go from here?
> > >
> > > thank you.
> > > Tom
> > >
> > >
> > >
> > > ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11
> > > CrawlDB update for TestCrawler
> > > /home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2
> -D
> > > mapred.child.java.opts=-Xmx1000m -D
> > > mapred.reduce.tasks.speculative.execution=false -D
> > > mapred.map.tasks.speculative.execution=false -D
> > > mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler
> > > DbUpdaterJob: starting at 2016-02-22 20:22:49
> > > DbUpdaterJob: batchId: 1456190521-30847
> > > DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09
> > > Indexing TestCrawler on SOLR index -> http://localhost:8983/solr
> > > /home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
> > > mapred.child.java.opts=-Xmx1000m -D
> > > mapred.reduce.tasks.speculative.execution=false -D
> > > mapred.map.tasks.speculative.execution=false -D
> > > mapred.compress.map.output=true -D solr.server.url=
> > > http://localhost:8983/solr -all -crawlId TestCrawler
> > > IndexingJob: starting
> > > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > > name=[TestCrawler]Indexer, jobid=job_local1592190856_0001
> > >         at
> > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> > >         at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> > >         at
> > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> > >         at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > >         at
> > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> > >
> > >
> > >
> > >
> > > *Error running:  /home/nutch/runtime/local/bin/nutch index -D
> > > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> > > mapred.reduce.tasks.speculative.execution=false -D
> > > mapred.map.tasks.speculative.execution=false -D
> > > mapred.compress.map.output=true -D
> > > solr.server.url=http://localhost:8983/solr <http://localhost:8983/solr
> >
> > > -all -crawlId TestCrawlerFailed with exit value 255.*
> > >
> > > On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <binoydala...@gmail.com>
> > > wrote:
> > >
> > > > CrawlID put any number
> > > > Number of rounds >=1
> > > > Seed dir and solr URL is proper
> > > >
> > > > On Mon, 22 Feb 2016, 12:17 Tom Running <runningt...@gmail.com>
> wrote:
> > > >
> > > > > Binoy,
> > > > >
> > > > > I do see the information on the console and also lot of information
> > in
> > > > > hbase.
> > > > >
> > > > > I tried ./crawl  but not quite sure where to location the following
> > > > > information:
> > > > >
> > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > > >
> > > > > seedDir   ../urls/seed.txt  ?
> > > > > crawID  ?
> > > > > solrUrl     I am guessing this will be http://localhost:8983/solr/
> > > > > numberOfRounds  ?
> > > > >
> > > > > Could you provide some advice on how to determine the above
> > > information.
> > > > >
> > > > >
> > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > > >
> > > > > Thanks,
> > > > > Tom
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal <
> binoydala...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > When you run the inject and generate commands, in the console
> > output
> > > do
> > > > > you
> > > > > > see your site being added?
> > > > > > Also while fetching and parsing you should be able to see the
> > number
> > > of
> > > > > > successful fetches and parse actions in your console. Ideally
> this
> > > > should
> > > > > > be equal to or more than the number of sites you've put in the
> > > seed.txt
> > > > > > file.
> > > > > > If this is not the case then there is some issue with either your
> > > > > seed.txt
> > > > > > file or the regex-urlfilter file.
> > > > > >
> > > > > > While running the crawl command, you doing need to index to solr
> > > > > > separately. The command will do it for you.
> > > > > > Run ./crawl to see usage instructions.
> > > > > >
> > > > > > On Mon, 22 Feb 2016, 11:41 Tom Running <runningt...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Yes, I did ran these before run ./nutch solrindex
> > > > > > > http://localhost:8983/solr/ -all and get nothing.
> > > > > > >
> > > > > > >
> > > > > > > From /home/nutch/runtime/local/bin/
> > > > > > >
> > > > > > > ./nutch inject ../urls/seed.txt
> > > > > > > ./nutch readdb
> > > > > > > ./nutch generate -topN 2500
> > > > > > > ./nutch fetch -all
> > > > > > > ./nutch parse -all
> > > > > > > ./nutch updatedb
> > > > > > >
> > > > > > > Did not run the crawl command.
> > > > > > >
> > > > > > > Would I just run ./crawl ??
> > > > > > > then run this again ./nutch solrindex
> > http://localhost:8983/solr/
> > > > -all
> > > > > > >
> > > > > > > Thank you very much for response to my questions.
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal <
> > > > binoydala...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Just to be clear, you did run the preceding nutch commands to
> > > > inject,
> > > > > > > > generate, fetch and parse the URLs right?
> > > > > > > >
> > > > > > > > Additionally try with the ./crawl command to directly crawl
> and
> > > > index
> > > > > > > > everything to solr without having to manually run all the
> > steps.
> > > > > > > >
> > > > > > > > On Mon, 22 Feb 2016, 07:24 Tom Running <
> runningt...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > I am trying to get Nutch to run solrindex and having
> problem.
> > > I
> > > > am
> > > > > > > using
> > > > > > > > > the following instruction from
> > > > > > > > > this document http://wiki.apache.org/nutch/Nutch2Tutorial.
> > > > > > Everything
> > > > > > > > > are working except when I ran the following command.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *./nutch solrindex http://localhost:8983/solr <
> > > > > > > > http://localhost:8983/solr>
> > > > > > > > > -all*
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ****** it came back with the following info  *****
> > > > > > > > > ****** It seems to have problem with indexing ****
> > > > > > > > > IndexingJob: starting
> > > > > > > > > Active IndexWriters :
> > > > > > > > > SOLRIndexWriter
> > > > > > > > >         solr.server.url : URL of the SOLR instance
> > (mandatory)
> > > > > > > > >         solr.commit.size : buffer size when sending to SOLR
> > > > > (default
> > > > > > > > 1000)
> > > > > > > > >         solr.mapping.file : name of the mapping file for
> > fields
> > > > > > > (default
> > > > > > > > > solrindex-mapping.xml)
> > > > > > > > >         solr.auth : use authentication* (default false)*
> > > > > > > > >         solr.auth.username : username for authentication
> > > > > > > > >         solr.auth.password : password for authentication
> > > > > > > > > IndexingJob: done.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > When I launch the SOLR Web UI interface can not query or
> find
> > > any
> > > > > > > things
> > > > > > > > > under the default collection1 or the
> > > > gettingstarted_shard1_replica1
> > > > > > or
> > > > > > > > > gettingstarted_shard2_replica1
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I have also tried with this option (with the colletion1)
> and
> > > > still
> > > > > > not
> > > > > > > > > able to query anything.
> > > > > > > > > ./nutch solrindex http://localhost:8983/solr/collection1
> > -all
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > After download SOLR 4.10.3 and start it as it with command
> > > > > > > > > /home/solr/bin/solr start -e cloud -noprompt
> > > > > > > > >
> > > > > > > > > I did not modify any configuration file not posting any
> file
> > or
> > > > > > > directory
> > > > > > > > > from within SOLR. I am assuming this command ./nutch
> > solrindex
> > > > > > > > > http://localhost:8983/solr/collection1 will do all the
> > posting
> > > > and
> > > > > > > index
> > > > > > > > > for SOLR.
> > > > > > > > >
> > > > > > > > > Any ideas what am I missing here.  Any advice where to go
> > from
> > > > here
> > > > > > > > would
> > > > > > > > > be greatly appreciate.
> > > > > > > > >
> > > > > > > > > I Did tried copy /nutch/runtime/local/conf/*.*   into SOLR
> > and
> > > it
> > > > > did
> > > > > > > not
> > > > > > > > > make any different.
> > > > > > > > >
> > > > > > > > > Thank you.
> > > > > > > > >
> > > > > > > > > Tom
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > Regards,
> > > > > > > > Binoy Dalal
> > > > > > > >
> > > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Binoy Dalal
> > > > > >
> > > > >
> > > > --
> > > > Regards,
> > > > Binoy Dalal
> > > >
> > >
> > --
> > Regards,
> > Binoy Dalal
> >
>
-- 
Regards,
Binoy Dalal

Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Reply via email to