Hi Ray,

Haha the documentations :) Let's hope that it'll get better or we'll all need super human problem solving abilities. But perhaps you're on a better path by making a cookbook and contributing as you go...

Anyway, I happen to be working on it rn so I can help you troubleshoot some stuff. As I said earlier you need to go to Solr logs, which you can get either from the Solr directory directly or look in the webapp logs. It will tell you if there's a schema mismatch or something else. Post the log and we can all take a look.

As to your second question, I think I had a similar problem and we're both in luck because jsoup-extractor just came out. It can parse HTML with CSS selectors and I think there should be a way to mark the indexed metadata as outlinks to include in the next round of search.

Hope this helps! let me know if I missed something,

Michael



On 08/15/2017 10:15 PM, Ray Crawford wrote:
The documentation is a little bit tough... :)

Really, I couldn't find a clear path for the novice from point A to point
B.  Because of this, I'm hoping this Chef Cookbook can be the tool.

Here's what I have so far:
https://github.com/raycrawford/cb_rayCrawford_nutch2

Two problems.  When I do the following, stuff gets into Solr, but it
results in:
cd /opt/nutch/runtime/local/bin
export JAVA_HOME='/etc/alternatives/jre_1.8.0'
/opt/hbase/bin/start-hbase.sh
mkdir urls
echo "http://www.bidfta.com/"; > /opt/nutch/runtime/local/bin/urls/seed.txt
/opt/nutch/runtime/local/bin/nutch inject urls/seed.txt
/opt/nutch/runtime/local/bin/crawl ./urls nutch
http://127.0.0.1:8983/solr/nutch
3


DbUpdaterJob: finished at 2017-08-16 05:01:46, time elapsed: 00:00:05

Indexing nutch on SOLR index -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D solr.server.url=
http://127.0.0.1:8983/solr/nutch -all -crawlId nutch

IndexingJob: starting

Active IndexWriters :

SOLRIndexWriter

solr.server.url : URL of the SOLR instance (mandatory)

solr.commit.size : buffer size when sending to SOLR (default 1000)

solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)

solr.auth : use authentication (default false)

solr.auth.username : username for authentication

solr.auth.password : password for authentication

IndexingJob: done.

SOLR dedup -> http://127.0.0.1:8983/solr/nutch

/opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Exception in thread "main" java.lang.RuntimeException: job failed:
name=apache-nutch-2.3.1.jar, jobid=job_local491881398_0001

at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:383)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.run(SolrDeleteDuplicates.java:393)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.main(SolrDeleteDuplicates.java:403)

Error running:

   /opt/nutch/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://127.0.0.1:8983/solr/nutch

Failed with exit value 1.
---

Second, the site I'm indexing is essentially 3 layers deep.  The first on
has a field on it '<p class="auctionLocation">'. All other children of that
page relate to the following link, but do not have that data on them. What
I would like to do is capture the <p class="auctionLocation"> data and
relate it to all children of that block. I altered the managed schema to
include '<field name="auctionLocation" type="strings"/>', but it doesn't
seem to be adding that to the index.  Also, I don't know how to add that to
the children pages.

What I'm asking here is two parts.  I realize the first part is a
nutch2/Solr integration thing and the second is a solr thing, but hopefully
y'all can help me figure this out...

Thanks!

On Tue, Aug 15, 2017 at 10:34 AM, Sebastian Nagel <
wastl.na...@googlemail.com> wrote:

Hi Alex,

no problem. Let's be productive and work!

Best,
Sebastian


On 08/15/2017 04:22 PM, Alejandro Caceres wrote:
Hey Sebastian,

I was just giving Lewis s*** because I know him personally :P. I'm aware
this is an open source project and we're all in this together! No one
likes
writing docs..... I should probably be working on my own docs right now.

Alex

On Tue, Aug 15, 2017 at 5:39 AM, Sebastian Nagel <
wastl.na...@googlemail.com
wrote:
Hi Alex,

I would like to state that it's *your* documentation as well,
as you're part of the community if following this list.

If I had the time to rewrite the tutorials and documentation
(and no open issues on Jira), no question, I probably would
work on it. If you have spare time, you're invited to improve
the documentation in any way you can. Just ask for access to
the Nutch wiki.

Thanks,
Sebastian

On 08/14/2017 09:10 PM, Alejandro Caceres wrote:
hey Lewis,

I think he's just trying to say that your documentation sucks :D. Glad
I
could clarify.

Alex

On Mon, Aug 14, 2017 at 3:03 PM, lewis john mcgibbney <
lewi...@apache.org>
wrote:

Hi Ray,
Apart from not being able to find a tutorial, what is wrong exactly?
New users of Nutch are advised to use the Nutch 1.X series.
The Nutch 2.X tutorial introduces more moving parts. This is well
documented on this mailing list for a number of years now.
If you can enumerate what is wrong, we will help you out.
Thanks
Lewis

On Sun, Aug 13, 2017 at 8:49 PM, <user-digest-h...@nutch.apache.org>
wrote:

From: Ray Crawford <ray.crawf...@gmail.com>
To: user@nutch.apache.org
Cc:
Bcc:
Date: Sun, 13 Aug 2017 23:48:59 -0400
Subject: I'm just going to throw this out there...
And it may get me banned, but so be it.

I've ben trying to get a Nutch/Solr setup running and, after many
hours
of
cruising StackOverflow, this list and many documentation sites which
talked
about various versions, I've got nothing to show for it.

Why is this so complex and why is a reasonable set of documentation
about
how to integrate the solutions so hard to find?

Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some
one
can help me here, I'll write a Chef cookbook that automates the whole
thing.  However, I can't get any of the tutorials I've tried so far
to
work.

Thanks and hopefully the community will help me (and others) work
through
this or absolve me of my apparent ignorance.

- Ray.



--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney







Reply via email to