On Wednesday, August 16, 2017, Michael Chen < [email protected]> wrote:
> Hi Ray, > > Haha the documentations :) Let's hope that it'll get better or we'll all > need super human problem solving abilities. But perhaps you're on a better > path by making a cookbook and contributing as you go... > > Anyway, I happen to be working on it rn so I can help you troubleshoot > some stuff. As I said earlier you need to go to Solr logs, which you can > get either from the Solr directory directly or look in the webapp logs. It > will tell you if there's a schema mismatch or something else. Post the log > and we can all take a look. > > As to your second question, I think I had a similar problem and we're both > in luck because jsoup-extractor just came out. It can parse HTML with CSS > selectors and I think there should be a way to mark the indexed metadata as > outlinks to include in the next round of search. > > Hope this helps! let me know if I missed something, > > Michael > > > > On 08/15/2017 10:15 PM, Ray Crawford wrote: > >> The documentation is a little bit tough... :) >> >> Really, I couldn't find a clear path for the novice from point A to point >> B. Because of this, I'm hoping this Chef Cookbook can be the tool. >> >> Here's what I have so far: >> https://github.com/raycrawford/cb_rayCrawford_nutch2 >> >> Two problems. When I do the following, stuff gets into Solr, but it >> results in: >> cd /opt/nutch/runtime/local/bin >> export JAVA_HOME='/etc/alternatives/jre_1.8.0' >> /opt/hbase/bin/start-hbase.sh >> mkdir urls >> echo "http://www.bidfta.com/" > /opt/nutch/runtime/local/bin/u >> rls/seed.txt >> /opt/nutch/runtime/local/bin/nutch inject urls/seed.txt >> /opt/nutch/runtime/local/bin/crawl ./urls nutch >> http://127.0.0.1:8983/solr/nutch >> 3 >> >> >> DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05 >> >> Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch >> >> /opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D >> mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true -D solr.server.url= >> http://127.0.0.1:8983/solr/nutch -all -crawlId nutch >> >> IndexingJob: starting >> >> Active IndexWriters : >> >> SOLRIndexWriter >> >> solr.server.url : URL of the SOLR instance (mandatory) >> >> solr.commit.size : buffer size when sending to SOLR (default 1000) >> >> solr.mapping.file : name of the mapping file for fields (default >> solrindex-mapping.xml) >> >> solr.auth : use authentication (default false) >> >> solr.auth.username : username for authentication >> >> solr.auth.password : password for authentication >> >> IndexingJob: done. >> >> SOLR dedup -> http://127.0.0.1:8983/solr/nutch >> >> /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D >> mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch >> >> Exception in thread "main" java.lang.RuntimeException: job failed: >> name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001 >> >> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) >> >> at >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(Sol >> rDeleteDuplicates.java:383) >> >> at >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrD >> eleteDuplicates.java:393) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >> >> at >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(Solr >> DeleteDuplicates.java:403) >> >> Error running: >> >> /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 >> -D >> mapred.child.java.opts=-Xmx1000m -D >> mapred.reduce.tasks.speculative.execution=false -D >> mapred.map.tasks.speculative.execution=false -D >> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch >> >> Failed with exit value 1. >> --- >> >> Second, the site I'm indexing is essentially 3 layers deep. The first on >> has a field on it '<p class="auctionLocation">'. All other children of >> that >> page relate to the following link, but do not have that data on them. What >> I would like to do is capture the <p class="auctionLocation"> data and >> relate it to all children of that block. I altered the managed schema to >> include '<field name="auctionLocation" type="strings"/>', but it doesn't >> seem to be adding that to the index. Also, I don't know how to add that >> to >> the children pages. >> >> What I'm asking here is two parts. I realize the first part is a >> nutch2/Solr integration thing and the second is a solr thing, but >> hopefully >> y'all can help me figure this out... >> >> Thanks! >> >> On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel < >> [email protected]> wrote: >> >> Hi Alex, >>> >>> no problem. Let's be productive and work! >>> >>> Best, >>> Sebastian >>> >>> >>> On 08/15/2017 04:22 PM, Alejandro Caceres wrote: >>> >>>> Hey Sebastian, >>>> >>>> I was just giving Lewis s*** because I know him personally :P. I'm aware >>>> this is an open source project and we're all in this together! No one >>>> >>> likes >>> >>>> writing docs..... I should probably be working on my own docs right now. >>>> >>>> Alex >>>> >>>> On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel < >>>> >>> [email protected] >>> >>>> wrote: >>>>> Hi Alex, >>>>> >>>>> I would like to state that it's *your* documentation as well, >>>>> as you're part of the community if following this list. >>>>> >>>>> If I had the time to rewrite the tutorials and documentation >>>>> (and no open issues on Jira), no question, I probably would >>>>> work on it. If you have spare time, you're invited to improve >>>>> the documentation in any way you can. Just ask for access to >>>>> the Nutch wiki. >>>>> >>>>> Thanks, >>>>> Sebastian >>>>> >>>>> On 08/14/2017 09:10 PM, Alejandro Caceres wrote: >>>>> >>>>>> hey Lewis, >>>>>> >>>>>> I think he's just trying to say that your documentation sucks :D. Glad >>>>>> >>>>> I >>> >>>> could clarify. >>>>>> >>>>>> Alex >>>>>> >>>>>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney < >>>>>> >>>>> [email protected]> >>>>> >>>>>> wrote: >>>>>> >>>>>> Hi Ray, >>>>>>> Apart from not being able to find a tutorial, what is wrong exactly? >>>>>>> New users of Nutch are advised to use the Nutch 1.X series. >>>>>>> The Nutch 2.X tutorial introduces more moving parts. This is well >>>>>>> documented on this mailing list for a number of years now. >>>>>>> If you can enumerate what is wrong, we will help you out. >>>>>>> Thanks >>>>>>> Lewis >>>>>>> >>>>>>> On Sun, Aug 13, 2017 at 8:49 PM, <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> From: Ray Crawford <[email protected]> >>>>>>>> To: [email protected] >>>>>>>> Cc: >>>>>>>> Bcc: >>>>>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400 >>>>>>>> Subject: I'm just going to throw this out there... >>>>>>>> And it may get me banned, but so be it. >>>>>>>> >>>>>>>> I've ben trying to get a Nutch/Solr setup running and, after many >>>>>>>> >>>>>>> hours >>> >>>> of >>>>>>> >>>>>>>> cruising StackOverflow, this list and many documentation sites which >>>>>>>> >>>>>>> talked >>>>>>> >>>>>>>> about various versions, I've got nothing to show for it. >>>>>>>> >>>>>>>> Why is this so complex and why is a reasonable set of documentation >>>>>>>> >>>>>>> about >>>>> >>>>>> how to integrate the solutions so hard to find? >>>>>>>> >>>>>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial? If some >>>>>>>> >>>>>>> one >>>>> >>>>>> can help me here, I'll write a Chef cookbook that automates the whole >>>>>>>> thing. However, I can't get any of the tutorials I've tried so far >>>>>>>> >>>>>>> to >>> >>>> work. >>>>>>>> >>>>>>>> Thanks and hopefully the community will help me (and others) work >>>>>>>> >>>>>>> through >>>>> >>>>>> this or absolve me of my apparent ignorance. >>>>>>>> >>>>>>>> - Ray. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> http://home.apache.org/~lewismc/ >>>>>>> @hectorMcSpector >>>>>>> http://www.linkedin.com/in/lmcgibbney >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> > As others have suggested using nutch 1x is the way to go. A problem with nutch 2.x is that way all the pluggable x's are version specific. For example the cassandra support uses gora and a really old version of cassandra. Hbase is a similar story, latest hbase has breaking api changes. The management server wont try catch problems well . Sections wont load or work until you figure the root cause out and the logging to catch the problems seems off by default. -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.

