Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Tom Running Thu, 25 Feb 2016 18:16:38 -0800

Binoy,

I copy over the schema.XML file for solr from
nutch's config dir.  Re-run the command and it work.


Thank you very much for your help.  Your advice is greatly appreciate it.

Tom



On Tue, Feb 23, 2016 at 3:55 AM, Binoy Dalal <[email protected]> wrote:

> There's some issues with your schema.XML file for solr.
> Did you copy over the schema file from nutch's config dir to your solr
> core's conf?
>
> As you can see from the unknown field error, the field mera_description is
> missing from you solr schema, and so when a document which this field is
> indexed by solr, solr doesn't recognise the meta_description field and
> throws an error.
>
> So either create this field or copy over nutch's schema file to you solr
> core's conf.
>
> On Tue, 23 Feb 2016, 14:08 Tom Running <[email protected]> wrote:
>
> > Here is the solr and nutch log files.  I found a few errors in these logs
> > not quite sure how to fix it.
> > Perhaps, I am not quite understand how Nutch, Solr and Hbase work
> together
> > that why it is so difficult to get them to work together correctly.
> >
> > How does these three packages working together?  This is how I understand
> > it.  Please correct me if I am not in the right track.
> >
> > Use Nutch 2.3.1  to Crawl data, using hbase (.98.8) as a database for
> > Nutch's db and Nutch's crawling contents, using nutch solrindex
> > http://localhost:8983/solr/ -all      to tell SOLR (4.10.3) to index
> > Nutch's crawl data that resided in the Hbase database?
> >
> > How does SOLR know where to get it data. In this case, the data that we
> > want SORL to use is in the Hbase table.  Do I have to perform a POST that
> > point to the Hbase or some sort??
> >
> >
> > Thank you for looking to this problem.  I am going crazy (-:
> >
> >
> >
> > ***********  SOLR's log file **************************
> >
> > INFO  - 2016-02-23 02:47:15.951;
> org.apache.solr.core.QuerySenderListener;
> > QuerySenderListener sending requests to Searcher@6ff86671[collection1]
> > main{StandardDir
> > ectoryReader(segments_1:1:nrt)}
> > INFO  - 2016-02-23 02:47:15.951;
> org.apache.solr.core.QuerySenderListener;
> > QuerySenderListener done.
> > INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.SolrCore;
> > [collection1] Registered new searcher Searcher@6ff86671[collection1]
> > main{StandardDirectoryReader(
> > segments_1:1:nrt)}
> > INFO  - 2016-02-23 02:47:15.952; org.apache.solr.core.SolrCore;
> > [collection1] Closing main searcher on request.
> > INFO  - 2016-02-23 02:47:15.953;
> > org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
> > path=/admin/cores params={action=RELOAD&_=1456213620483&core=coll
> > ection1&wt=json} status=0 QTime=1375
> > INFO  - 2016-02-23 02:47:15.986;
> > org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
> > path=/admin/cores params={_=1456213621893&wt=json} status=0 QTime
> > =1
> > [root@localhost logs]#          1959432 [qtp969637605-11] INFO
> > org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
> > webapp=/solr path=/update par             ams={wt=javabin&version=2} {} 0
> > 55
> > 1959433 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
> > org.apache.solr.common.SolrException:
> > * ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
> > tion'        at *
> >
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
> >         at
> >
> >
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)
> >
> >
> >         at
> >
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> >         at
> >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> >         at
> >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > 1959465 [qtp969637605-11] INFO
> > org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
> > webapp=/solr path=/update params={wt=javabin&version=2} {} 0
>   2
> > 1959466 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
> > org.apache.solr.common.SolrException:
> >
> >
> > *ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
> > tion'        at
> >
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
> > at
> >
> >
> org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)*
> >
> >
> >
> >
> >
> > ****************************************************************
> >
> > NUTCH (hadoop.log)
> >
> > 2016-02-23 02:58:40,659 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2016-02-23 02:58:40,701
> > * WARN  store.HBaseStore - Mismatching schema's names. Mappingfile
> schema:
> > 'webpage'. PersistentClass schema's name: 'webpage_webpage'Assuming they
> > are the same.*
> > 2016-02-23 02:58:41,158 INFO  solr.SolrIndexWriter - Adding 1 documents
> > 2016-02-23 02:58:41,574 INFO  solr.SolrIndexWriter - Adding 1 documents
> > 2016-02-23 02:58:41,614 WARN  mapred.LocalJobRunner -
> > job_local1682084779_0001
> > java.lang.Exception:
> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > *ERROR: [doc=com.galco.www:http] unknown field 'meta_description'
> at
> > *
> >
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
> > Caused by:
> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > ERROR: [doc=com.galco.www:http] unknown field 'meta_description'
> >         at
> >
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >         at
> >
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >         at
> >
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >         at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >         at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >         at
> >
> >
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97)
> >         at
> > org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
> >         at
> >
> >
> org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
> >         at
> >
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
> >         at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >         at
> >
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >         at java.lang.Thread.run(Thread.java:745)
> > 2016-02-23 02:58:41,653* ERROR indexer.IndexingJob - SolrIndexerJob:
> > java.lang.RuntimeException: job failed: name=[webpage]Indexer,
> > jobid=job_local1682084779_0001*
> >         at
> > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> >         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >         at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >
> >
> >
> > On Mon, Feb 22, 2016 at 9:52 PM, Binoy Dalal <[email protected]>
> > wrote:
> >
> > > What errors do you see in hadoop.log and solr's solr.log?
> > > Post that stack trace.
> > >
> > > On Tue, 23 Feb 2016, 07:29 Tom Running <[email protected]> wrote:
> > >
> > > > Got errors:  when run this command:
> > > >  ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1
> > > > Have any idea where to go from here?
> > > >
> > > > thank you.
> > > > Tom
> > > >
> > > >
> > > >
> > > > ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11
> > > > CrawlDB update for TestCrawler
> > > > /home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2
> > -D
> > > > mapred.child.java.opts=-Xmx1000m -D
> > > > mapred.reduce.tasks.speculative.execution=false -D
> > > > mapred.map.tasks.speculative.execution=false -D
> > > > mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler
> > > > DbUpdaterJob: starting at 2016-02-22 20:22:49
> > > > DbUpdaterJob: batchId: 1456190521-30847
> > > > DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09
> > > > Indexing TestCrawler on SOLR index -> http://localhost:8983/solr
> > > > /home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
> > > > mapred.child.java.opts=-Xmx1000m -D
> > > > mapred.reduce.tasks.speculative.execution=false -D
> > > > mapred.map.tasks.speculative.execution=false -D
> > > > mapred.compress.map.output=true -D solr.server.url=
> > > > http://localhost:8983/solr -all -crawlId TestCrawler
> > > > IndexingJob: starting
> > > > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > > > name=[TestCrawler]Indexer, jobid=job_local1592190856_0001
> > > >         at
> > > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> > > >         at
> > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> > > >         at
> > > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> > > >         at
> > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> > > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > > >         at
> > > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> > > >
> > > >
> > > >
> > > >
> > > > *Error running:  /home/nutch/runtime/local/bin/nutch index -D
> > > > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> > > > mapred.reduce.tasks.speculative.execution=false -D
> > > > mapred.map.tasks.speculative.execution=false -D
> > > > mapred.compress.map.output=true -D
> > > > solr.server.url=http://localhost:8983/solr <
> http://localhost:8983/solr
> > >
> > > > -all -crawlId TestCrawlerFailed with exit value 255.*
> > > >
> > > > On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <[email protected]
> >
> > > > wrote:
> > > >
> > > > > CrawlID put any number
> > > > > Number of rounds >=1
> > > > > Seed dir and solr URL is proper
> > > > >
> > > > > On Mon, 22 Feb 2016, 12:17 Tom Running <[email protected]>
> > wrote:
> > > > >
> > > > > > Binoy,
> > > > > >
> > > > > > I do see the information on the console and also lot of
> information
> > > in
> > > > > > hbase.
> > > > > >
> > > > > > I tried ./crawl  but not quite sure where to location the
> following
> > > > > > information:
> > > > > >
> > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > > > >
> > > > > > seedDir   ../urls/seed.txt  ?
> > > > > > crawID  ?
> > > > > > solrUrl     I am guessing this will be
> http://localhost:8983/solr/
> > > > > > numberOfRounds  ?
> > > > > >
> > > > > > Could you provide some advice on how to determine the above
> > > > information.
> > > > > >
> > > > > >
> > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > > > >
> > > > > > Thanks,
> > > > > > Tom
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > When you run the inject and generate commands, in the console
> > > output
> > > > do
> > > > > > you
> > > > > > > see your site being added?
> > > > > > > Also while fetching and parsing you should be able to see the
> > > number
> > > > of
> > > > > > > successful fetches and parse actions in your console. Ideally
> > this
> > > > > should
> > > > > > > be equal to or more than the number of sites you've put in the
> > > > seed.txt
> > > > > > > file.
> > > > > > > If this is not the case then there is some issue with either
> your
> > > > > > seed.txt
> > > > > > > file or the regex-urlfilter file.
> > > > > > >
> > > > > > > While running the crawl command, you doing need to index to
> solr
> > > > > > > separately. The command will do it for you.
> > > > > > > Run ./crawl to see usage instructions.
> > > > > > >
> > > > > > > On Mon, 22 Feb 2016, 11:41 Tom Running <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > Yes, I did ran these before run ./nutch solrindex
> > > > > > > > http://localhost:8983/solr/ -all and get nothing.
> > > > > > > >
> > > > > > > >
> > > > > > > > From /home/nutch/runtime/local/bin/
> > > > > > > >
> > > > > > > > ./nutch inject ../urls/seed.txt
> > > > > > > > ./nutch readdb
> > > > > > > > ./nutch generate -topN 2500
> > > > > > > > ./nutch fetch -all
> > > > > > > > ./nutch parse -all
> > > > > > > > ./nutch updatedb
> > > > > > > >
> > > > > > > > Did not run the crawl command.
> > > > > > > >
> > > > > > > > Would I just run ./crawl ??
> > > > > > > > then run this again ./nutch solrindex
> > > http://localhost:8983/solr/
> > > > > -all
> > > > > > > >
> > > > > > > > Thank you very much for response to my questions.
> > > > > > > >
> > > > > > > > Tom
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Just to be clear, you did run the preceding nutch commands
> to
> > > > > inject,
> > > > > > > > > generate, fetch and parse the URLs right?
> > > > > > > > >
> > > > > > > > > Additionally try with the ./crawl command to directly crawl
> > and
> > > > > index
> > > > > > > > > everything to solr without having to manually run all the
> > > steps.
> > > > > > > > >
> > > > > > > > > On Mon, 22 Feb 2016, 07:24 Tom Running <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I am trying to get Nutch to run solrindex and having
> > problem.
> > > > I
> > > > > am
> > > > > > > > using
> > > > > > > > > > the following instruction from
> > > > > > > > > > this document
> http://wiki.apache.org/nutch/Nutch2Tutorial.
> > > > > > > Everything
> > > > > > > > > > are working except when I ran the following command.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *./nutch solrindex http://localhost:8983/solr <
> > > > > > > > > http://localhost:8983/solr>
> > > > > > > > > > -all*
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ****** it came back with the following info  *****
> > > > > > > > > > ****** It seems to have problem with indexing ****
> > > > > > > > > > IndexingJob: starting
> > > > > > > > > > Active IndexWriters :
> > > > > > > > > > SOLRIndexWriter
> > > > > > > > > >         solr.server.url : URL of the SOLR instance
> > > (mandatory)
> > > > > > > > > >         solr.commit.size : buffer size when sending to
> SOLR
> > > > > > (default
> > > > > > > > > 1000)
> > > > > > > > > >         solr.mapping.file : name of the mapping file for
> > > fields
> > > > > > > > (default
> > > > > > > > > > solrindex-mapping.xml)
> > > > > > > > > >         solr.auth : use authentication* (default false)*
> > > > > > > > > >         solr.auth.username : username for authentication
> > > > > > > > > >         solr.auth.password : password for authentication
> > > > > > > > > > IndexingJob: done.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > When I launch the SOLR Web UI interface can not query or
> > find
> > > > any
> > > > > > > > things
> > > > > > > > > > under the default collection1 or the
> > > > > gettingstarted_shard1_replica1
> > > > > > > or
> > > > > > > > > > gettingstarted_shard2_replica1
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I have also tried with this option (with the colletion1)
> > and
> > > > > still
> > > > > > > not
> > > > > > > > > > able to query anything.
> > > > > > > > > > ./nutch solrindex http://localhost:8983/solr/collection1
> > > -all
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > After download SOLR 4.10.3 and start it as it with
> command
> > > > > > > > > > /home/solr/bin/solr start -e cloud -noprompt
> > > > > > > > > >
> > > > > > > > > > I did not modify any configuration file not posting any
> > file
> > > or
> > > > > > > > directory
> > > > > > > > > > from within SOLR. I am assuming this command ./nutch
> > > solrindex
> > > > > > > > > > http://localhost:8983/solr/collection1 will do all the
> > > posting
> > > > > and
> > > > > > > > index
> > > > > > > > > > for SOLR.
> > > > > > > > > >
> > > > > > > > > > Any ideas what am I missing here.  Any advice where to go
> > > from
> > > > > here
> > > > > > > > > would
> > > > > > > > > > be greatly appreciate.
> > > > > > > > > >
> > > > > > > > > > I Did tried copy /nutch/runtime/local/conf/*.*   into
> SOLR
> > > and
> > > > it
> > > > > > did
> > > > > > > > not
> > > > > > > > > > make any different.
> > > > > > > > > >
> > > > > > > > > > Thank you.
> > > > > > > > > >
> > > > > > > > > > Tom
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > > Binoy Dalal
> > > > > > > > >
> > > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > > Binoy Dalal
> > > > > > >
> > > > > >
> > > > > --
> > > > > Regards,
> > > > > Binoy Dalal
> > > > >
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Reply via email to