The documentation is a little bit tough... :) Really, I couldn't find a clear path for the novice from point A to point B. Because of this, I'm hoping this Chef Cookbook can be the tool.
Here's what I have so far: https://github.com/raycrawford/cb_rayCrawford_nutch2 Two problems. When I do the following, stuff gets into Solr, but it results in: cd /opt/nutch/runtime/local/bin export JAVA_HOME='/etc/alternatives/jre_1.8.0' /opt/hbase/bin/start-hbase.sh mkdir urls echo "http://www.bidfta.com/" > /opt/nutch/runtime/local/bin/urls/seed.txt /opt/nutch/runtime/local/bin/nutch inject urls/seed.txt /opt/nutch/runtime/local/bin/crawl ./urls nutch http://127.0.0.1:8983/solr/nutch 3 DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05 Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch /opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url= http://127.0.0.1:8983/solr/nutch -all -crawlId nutch IndexingJob: starting Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication IndexingJob: done. SOLR dedup -> http://127.0.0.1:8983/solr/nutch /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch Exception in thread "main" java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403) Error running: /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch Failed with exit value 1. --- Second, the site I'm indexing is essentially 3 layers deep. The first on has a field on it '<p class="auctionLocation">'. All other children of that page relate to the following link, but do not have that data on them. What I would like to do is capture the <p class="auctionLocation"> data and relate it to all children of that block. I altered the managed schema to include '<field name="auctionLocation" type="strings"/>', but it doesn't seem to be adding that to the index. Also, I don't know how to add that to the children pages. What I'm asking here is two parts. I realize the first part is a nutch2/Solr integration thing and the second is a solr thing, but hopefully y'all can help me figure this out... Thanks! On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel < [email protected]> wrote: > Hi Alex, > > no problem. Let's be productive and work! > > Best, > Sebastian > > > On 08/15/2017 04:22 PM, Alejandro Caceres wrote: > > Hey Sebastian, > > > > I was just giving Lewis s*** because I know him personally :P. I'm aware > > this is an open source project and we're all in this together! No one > likes > > writing docs..... I should probably be working on my own docs right now. > > > > Alex > > > > On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi Alex, > >> > >> I would like to state that it's *your* documentation as well, > >> as you're part of the community if following this list. > >> > >> If I had the time to rewrite the tutorials and documentation > >> (and no open issues on Jira), no question, I probably would > >> work on it. If you have spare time, you're invited to improve > >> the documentation in any way you can. Just ask for access to > >> the Nutch wiki. > >> > >> Thanks, > >> Sebastian > >> > >> On 08/14/2017 09:10 PM, Alejandro Caceres wrote: > >>> hey Lewis, > >>> > >>> I think he's just trying to say that your documentation sucks :D. Glad > I > >>> could clarify. > >>> > >>> Alex > >>> > >>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney < > >> [email protected]> > >>> wrote: > >>> > >>>> Hi Ray, > >>>> Apart from not being able to find a tutorial, what is wrong exactly? > >>>> New users of Nutch are advised to use the Nutch 1.X series. > >>>> The Nutch 2.X tutorial introduces more moving parts. This is well > >>>> documented on this mailing list for a number of years now. > >>>> If you can enumerate what is wrong, we will help you out. > >>>> Thanks > >>>> Lewis > >>>> > >>>> On Sun, Aug 13, 2017 at 8:49 PM, <[email protected]> > >>>> wrote: > >>>> > >>>>> > >>>>> From: Ray Crawford <[email protected]> > >>>>> To: [email protected] > >>>>> Cc: > >>>>> Bcc: > >>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400 > >>>>> Subject: I'm just going to throw this out there... > >>>>> And it may get me banned, but so be it. > >>>>> > >>>>> I've ben trying to get a Nutch/Solr setup running and, after many > hours > >>>> of > >>>>> cruising StackOverflow, this list and many documentation sites which > >>>> talked > >>>>> about various versions, I've got nothing to show for it. > >>>>> > >>>>> Why is this so complex and why is a reasonable set of documentation > >> about > >>>>> how to integrate the solutions so hard to find? > >>>>> > >>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial? If some > >> one > >>>>> can help me here, I'll write a Chef cookbook that automates the whole > >>>>> thing. However, I can't get any of the tutorials I've tried so far > to > >>>>> work. > >>>>> > >>>>> Thanks and hopefully the community will help me (and others) work > >> through > >>>>> this or absolve me of my apparent ignorance. > >>>>> > >>>>> - Ray. > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> http://home.apache.org/~lewismc/ > >>>> @hectorMcSpector > >>>> http://www.linkedin.com/in/lmcgibbney > >>>> > >>> > >>> > >>> > >> > >> > > > > > >

