*
It's a little more general and easier to not screw up ;-) If that's not acceptable for your purposes let us know I'm sure someone could help with the specific regexes.
On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper ianpi...@tellura.co.uk wrote:
Hi all,I have been trying to get to the bottom of t
to eliminate. It seems to me that your regex does not eliminate the type of
urls you specified.
Alex.
-Original Message-
From: Ian Piper ianpi...@tellura.co.uk
To: user user@nutch.apache.org
Sent: Mon, Jul 30, 2012 1:52 pm
Subject: Re: Why won't my crawl ignore these urls?
Hi
le for your purposes let us know I'm sure someone could help with the specific regexes.
On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper ianpi...@tellura.co.uk wrote:
Hi all,I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...I have a
to appear in the final Solr index. So
clearly they are not being excluded.
Is anyone able to explain what I have missed? Any guidance much appreciated.
Thanks,
Ian.
--
Dr Ian Piper
Tellura Information Services - the web, document and information people
Registered in England and Wales
[...]
and the corresponding pages seem to appear in the final Solr index. So
clearly they are not being excluded.
Is anyone able to explain what I have missed? Any guidance much appreciated.
Thanks,
Ian.
--
Dr Ian Piper
Tellura Information Services - the web, document
-by-step description
in layman's language.
Thanks anyway.
Ian.
--
On 23 Apr 2012, at 23:57, remi tassing wrote:
Have you read this?
http://wiki.apache.org/nutch/NutchTutorial/
You can put all commands in a shell script
Remi
On Monday, April 23, 2012, Ian Piper wrote:
Hi all,
I have
Hi Julien,
Thanks for the message. I think you have found part of the problem - I have
this in regex-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
I will try modifying this and re-running the crawl.
Ian.
--
On 23 Jan 2012, at 16:04, Julien
Hi Lewis,
Thanks for the reply. I'm using a fetch depth of 10 (which I thought would be
ample - this is not a deep site hierarchy). Here is the command I'm running:
bin/nutch crawl urls -solr [solrurl] -depth 10 -topN 5000
On 23 Jan 2012, at 16:02, Lewis John Mcgibbney wrote:
Hi Ian,
On 23 Jan 2012, at 16:04, Julien Nioche wrote:
check your URL filter : the link above contains a '?' which by default
would get the URL to be filtered out
That was definitely the problem. Nutch is happily fetching those documents now!
Thanks very much for your help.
Ian.
--
Hi all,
I'd appreciate some guidance... can't seem to find much useful stuff on the web
on this. I have set up a Nutch and Solr service that is crawling a client's
site. They have a lot of pages that are accessed with urls like this:
10 matches
Mail list logo