Re: I'm just going to throw this out there...

Michael Chen Tue, 15 Aug 2017 22:24:48 -0700

Hi Ray,

Haha the documentations :) Let's hope that it'll get better or we'll allneed super human problem solving abilities. But perhaps you're on abetter path by making a cookbook and contributing as you go...

Anyway, I happen to be working on it rn so I can help you troubleshootsome stuff. As I said earlier you need to go to Solr logs, which you canget either from the Solr directory directly or look in the webapp logs.It will tell you if there's a schema mismatch or something else. Postthe log and we can all take a look.

As to your second question, I think I had a similar problem and we'reboth in luck because jsoup-extractor just came out. It can parse HTMLwith CSS selectors and I think there should be a way to mark the indexedmetadata as outlinks to include in the next round of search.


Hope this helps! let me know if I missed something,

Michael



On 08/15/2017 10:15 PM, Ray Crawford wrote:

The documentation is a little bit tough... :)

Really, I couldn't find a clear path for the novice from point A to point
B.  Because of this, I'm hoping this Chef Cookbook can be the tool.

Here's what I have so far:
https://github.com/raycrawford/cb_rayCrawford_nutch2

Two problems.  When I do the following, stuff gets into Solr, but it
results in:
cd /opt/nutch/runtime/local/bin
export JAVA_HOME='/etc/alternatives/jre_1.8.0'
/opt/hbase/bin/start-hbase.sh
mkdir urls
echo "http://www.bidfta.com/"; > /opt/nutch/runtime/local/bin/urls/seed.txt
/opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
/opt/nutch/runtime/local/bin/crawl ./urls nutch
http://127.0.0.1:8983/solr/nutch
3


DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05

Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D solr.server.url=
http://127.0.0.1:8983/solr/nutch -all -crawlId nutch

IndexingJob: starting

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : username for authentication

solr.auth.password : password for authentication

IndexingJob: done.

SOLR dedup -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Exception in thread "main" java.lang.RuntimeException: job failed:
name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001

at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)

Error running:

   /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Failed with exit value 1.
---

Second, the site I'm indexing is essentially 3 layers deep.  The first on
has a field on it '<p class="auctionLocation">'. All other children of that
page relate to the following link, but do not have that data on them. What
I would like to do is capture the <p class="auctionLocation"> data and
relate it to all children of that block. I altered the managed schema to
include '<field name="auctionLocation" type="strings"/>', but it doesn't
seem to be adding that to the index.  Also, I don't know how to add that to
the children pages.

What I'm asking here is two parts.  I realize the first part is a
nutch2/Solr integration thing and the second is a solr thing, but hopefully
y'all can help me figure this out...

Thanks!

On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
wastl.na...@googlemail.com> wrote:

Hi Alex,

no problem. Let's be productive and work!

Best,
Sebastian


On 08/15/2017 04:22 PM, Alejandro Caceres wrote:

Hey Sebastian,

I was just giving Lewis s*** because I know him personally :P. I'm aware
this is an open source project and we're all in this together! No one

likes

writing docs..... I should probably be working on my own docs right now.

Alex

On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <

wastl.na...@googlemail.com

wrote:
Hi Alex,

I would like to state that it's *your* documentation as well,
as you're part of the community if following this list.

If I had the time to rewrite the tutorials and documentation
(and no open issues on Jira), no question, I probably would
work on it. If you have spare time, you're invited to improve
the documentation in any way you can. Just ask for access to
the Nutch wiki.

Thanks,
Sebastian

On 08/14/2017 09:10 PM, Alejandro Caceres wrote:

hey Lewis,

I think he's just trying to say that your documentation sucks :D. Glad

could clarify.

Alex

On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <

lewi...@apache.org>

wrote:

Hi Ray,
Apart from not being able to find a tutorial, what is wrong exactly?
New users of Nutch are advised to use the Nutch 1.X series.
The Nutch 2.X tutorial introduces more moving parts. This is well
documented on this mailing list for a number of years now.
If you can enumerate what is wrong, we will help you out.
Thanks
Lewis

On Sun, Aug 13, 2017 at 8:49 PM, <user-digest-h...@nutch.apache.org>
wrote:

From: Ray Crawford <ray.crawf...@gmail.com>
To: user@nutch.apache.org
Cc:
Bcc:
Date: Sun, 13 Aug 2017 23:48:59 -0400
Subject: I'm just going to throw this out there...
And it may get me banned, but so be it.

I've ben trying to get a Nutch/Solr setup running and, after many

hours

of

cruising StackOverflow, this list and many documentation sites which

talked

about various versions, I've got nothing to show for it.

Why is this so complex and why is a reasonable set of documentation

about

how to integrate the solutions so hard to find?

Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some

one

can help me here, I'll write a Chef cookbook that automates the whole
thing.  However, I can't get any of the tutorials I've tried so far

to

work.

Thanks and hopefully the community will help me (and others) work

through

this or absolve me of my apparent ignorance.

- Ray.


--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Re: I'm just going to throw this out there...

Reply via email to