Re: I'm just going to throw this out there...

Edward Capriolo Sun, 20 Aug 2017 06:25:18 -0700

On Wednesday, August 16, 2017, Michael Chen <
[email protected]> wrote:


> Hi Ray,
>
> Haha the documentations :) Let's hope that it'll get better or we'll all
> need super human problem solving abilities. But perhaps you're on a better
> path by making a cookbook and contributing as you go...
>
> Anyway, I happen to be working on it rn so I can help you troubleshoot
> some stuff. As I said earlier you need to go to Solr logs, which you can
> get either from the Solr directory directly or look in the webapp logs. It
> will tell you if there's a schema mismatch or something else. Post the log
> and we can all take a look.
>
> As to your second question, I think I had a similar problem and we're both
> in luck because jsoup-extractor just came out. It can parse HTML with CSS
> selectors and I think there should be a way to mark the indexed metadata as
> outlinks to include in the next round of search.
>
> Hope this helps! let me know if I missed something,
>
> Michael
>
>
>
> On 08/15/2017 10:15 PM, Ray Crawford wrote:
>
>> The documentation is a little bit tough... :)
>>
>> Really, I couldn't find a clear path for the novice from point A to point
>> B.  Because of this, I'm hoping this Chef Cookbook can be the tool.
>>
>> Here's what I have so far:
>> https://github.com/raycrawford/cb_rayCrawford_nutch2
>>
>> Two problems.  When I do the following, stuff gets into Solr, but it
>> results in:
>> cd /opt/nutch/runtime/local/bin
>> export JAVA_HOME='/etc/alternatives/jre_1.8.0'
>> /opt/hbase/bin/start-hbase.sh
>> mkdir urls
>> echo "http://www.bidfta.com/"; > /opt/nutch/runtime/local/bin/u
>> rls/seed.txt
>> /opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
>> /opt/nutch/runtime/local/bin/crawl ./urls nutch
>> http://127.0.0.1:8983/solr/nutch
>> 3
>>
>>
>> DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05
>>
>> Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch
>>
>> /opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true -D solr.server.url=
>> http://127.0.0.1:8983/solr/nutch -all -crawlId nutch
>>
>> IndexingJob: starting
>>
>> Active IndexWriters :
>>
>> SOLRIndexWriter
>>
>> solr.server.url : URL of the SOLR instance (mandatory)
>>
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>>
>> solr.auth : use authentication (default false)
>>
>> solr.auth.username : username for authentication
>>
>> solr.auth.password : password for authentication
>>
>> IndexingJob: done.
>>
>> SOLR dedup -> http://127.0.0.1:8983/solr/nutch
>>
>> /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>>
>> Exception in thread "main" java.lang.RuntimeException: job failed:
>> name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001
>>
>> at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(Sol
>> rDeleteDuplicates.java:383)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrD
>> eleteDuplicates.java:393)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(Solr
>> DeleteDuplicates.java:403)
>>
>> Error running:
>>
>>    /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2
>> -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch
>>
>> Failed with exit value 1.
>> ---
>>
>> Second, the site I'm indexing is essentially 3 layers deep.  The first on
>> has a field on it '<p class="auctionLocation">'. All other children of
>> that
>> page relate to the following link, but do not have that data on them. What
>> I would like to do is capture the <p class="auctionLocation"> data and
>> relate it to all children of that block. I altered the managed schema to
>> include '<field name="auctionLocation" type="strings"/>', but it doesn't
>> seem to be adding that to the index.  Also, I don't know how to add that
>> to
>> the children pages.
>>
>> What I'm asking here is two parts.  I realize the first part is a
>> nutch2/Solr integration thing and the second is a solr thing, but
>> hopefully
>> y'all can help me figure this out...
>>
>> Thanks!
>>
>> On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
>> [email protected]> wrote:
>>
>> Hi Alex,
>>>
>>> no problem. Let's be productive and work!
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>> On 08/15/2017 04:22 PM, Alejandro Caceres wrote:
>>>
>>>> Hey Sebastian,
>>>>
>>>> I was just giving Lewis s*** because I know him personally :P. I'm aware
>>>> this is an open source project and we're all in this together! No one
>>>>
>>> likes
>>>
>>>> writing docs..... I should probably be working on my own docs right now.
>>>>
>>>> Alex
>>>>
>>>> On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <
>>>>
>>> [email protected]
>>>
>>>> wrote:
>>>>> Hi Alex,
>>>>>
>>>>> I would like to state that it's *your* documentation as well,
>>>>> as you're part of the community if following this list.
>>>>>
>>>>> If I had the time to rewrite the tutorials and documentation
>>>>> (and no open issues on Jira), no question, I probably would
>>>>> work on it. If you have spare time, you're invited to improve
>>>>> the documentation in any way you can. Just ask for access to
>>>>> the Nutch wiki.
>>>>>
>>>>> Thanks,
>>>>> Sebastian
>>>>>
>>>>> On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
>>>>>
>>>>>> hey Lewis,
>>>>>>
>>>>>> I think he's just trying to say that your documentation sucks :D. Glad
>>>>>>
>>>>> I
>>>
>>>> could clarify.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
>>>>>>
>>>>> [email protected]>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Ray,
>>>>>>> Apart from not being able to find a tutorial, what is wrong exactly?
>>>>>>> New users of Nutch are advised to use the Nutch 1.X series.
>>>>>>> The Nutch 2.X tutorial introduces more moving parts. This is well
>>>>>>> documented on this mailing list for a number of years now.
>>>>>>> If you can enumerate what is wrong, we will help you out.
>>>>>>> Thanks
>>>>>>> Lewis
>>>>>>>
>>>>>>> On Sun, Aug 13, 2017 at 8:49 PM, <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> From: Ray Crawford <[email protected]>
>>>>>>>> To: [email protected]
>>>>>>>> Cc:
>>>>>>>> Bcc:
>>>>>>>> Date: Sun, 13 Aug 2017 23:48:59 -0400
>>>>>>>> Subject: I'm just going to throw this out there...
>>>>>>>> And it may get me banned, but so be it.
>>>>>>>>
>>>>>>>> I've ben trying to get a Nutch/Solr setup running and, after many
>>>>>>>>
>>>>>>> hours
>>>
>>>> of
>>>>>>>
>>>>>>>> cruising StackOverflow, this list and many documentation sites which
>>>>>>>>
>>>>>>> talked
>>>>>>>
>>>>>>>> about various versions, I've got nothing to show for it.
>>>>>>>>
>>>>>>>> Why is this so complex and why is a reasonable set of documentation
>>>>>>>>
>>>>>>> about
>>>>>
>>>>>> how to integrate the solutions so hard to find?
>>>>>>>>
>>>>>>>> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
>>>>>>>>
>>>>>>> one
>>>>>
>>>>>> can help me here, I'll write a Chef cookbook that automates the whole
>>>>>>>> thing.  However, I can't get any of the tutorials I've tried so far
>>>>>>>>
>>>>>>> to
>>>
>>>> work.
>>>>>>>>
>>>>>>>> Thanks and hopefully the community will help me (and others) work
>>>>>>>>
>>>>>>> through
>>>>>
>>>>>> this or absolve me of my apparent ignorance.
>>>>>>>>
>>>>>>>> - Ray.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> http://home.apache.org/~lewismc/
>>>>>>> @hectorMcSpector
>>>>>>> http://www.linkedin.com/in/lmcgibbney
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>
As others have suggested using nutch 1x is the way to go. A problem with
nutch 2.x is that way all the pluggable x's are version specific.

For example the cassandra support uses gora and a really old version of
cassandra.

Hbase is a similar story, latest hbase has breaking api changes.

The management server wont try catch problems well . Sections wont load or
work until you figure the root cause out and the logging to catch the
problems seems off by default.




-- 
Sorry this was sent from mobile. Will do less grammar and spell check than
usual.

Re: I'm just going to throw this out there...

Reply via email to