Re: Why won't my crawl ignore these urls? [SOLVED]

2012-08-03 Thread Ian Piper
* It's a little more general and easier to not screw up ;-) If that's not acceptable for your purposes let us know I'm sure someone could help with the specific regexes. On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper ianpi...@tellura.co.uk wrote: Hi all,I have been trying to get to the bottom of t

Re: Why won't my crawl ignore these urls?

2012-07-31 Thread Ian Piper
to eliminate. It seems to me that your regex does not eliminate the type of urls you specified. Alex. -Original Message- From: Ian Piper ianpi...@tellura.co.uk To: user user@nutch.apache.org Sent: Mon, Jul 30, 2012 1:52 pm Subject: Re: Why won't my crawl ignore these urls? Hi

Re: Why won't my crawl ignore these urls?

2012-07-31 Thread Ian Piper
le for your purposes let us know I'm sure someone could help with the specific regexes. On Mon, Jul 30, 2012 at 12:24 PM, Ian Piper ianpi...@tellura.co.uk wrote: Hi all,I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan...I have a

Re: Why won't my crawl ignore these urls?

2012-07-30 Thread Ian Piper
to appear in the final Solr index. So clearly they are not being excluded. Is anyone able to explain what I have missed? Any guidance much appreciated. Thanks, Ian. -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales

Re: Why won't my crawl ignore these urls?

2012-07-30 Thread Ian Piper
[...] and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded. Is anyone able to explain what I have missed? Any guidance much appreciated. Thanks, Ian. -- Dr Ian Piper Tellura Information Services - the web, document

Re: Good workflow for a regular re-indexing job

2012-04-24 Thread Ian Piper
-by-step description in layman's language. Thanks anyway. Ian. -- On 23 Apr 2012, at 23:57, remi tassing wrote: Have you read this? http://wiki.apache.org/nutch/NutchTutorial/ You can put all commands in a shell script Remi On Monday, April 23, 2012, Ian Piper wrote: Hi all, I have

Re: Following .axd urls

2012-01-24 Thread Ian Piper
Hi Julien, Thanks for the message. I think you have found part of the problem - I have this in regex-urlfilter.txt # skip URLs containing certain characters as probable queries, etc. -[?*!@=] I will try modifying this and re-running the crawl. Ian. -- On 23 Jan 2012, at 16:04, Julien

Re: Following .axd urls

2012-01-24 Thread Ian Piper
Hi Lewis, Thanks for the reply. I'm using a fetch depth of 10 (which I thought would be ample - this is not a deep site hierarchy). Here is the command I'm running: bin/nutch crawl urls -solr [solrurl] -depth 10 -topN 5000 On 23 Jan 2012, at 16:02, Lewis John Mcgibbney wrote: Hi Ian,

Re: Following .axd urls

2012-01-24 Thread Ian Piper
On 23 Jan 2012, at 16:04, Julien Nioche wrote: check your URL filter : the link above contains a '?' which by default would get the URL to be filtered out That was definitely the problem. Nutch is happily fetching those documents now! Thanks very much for your help. Ian. --

Following .axd urls

2012-01-23 Thread Ian Piper
Hi all, I'd appreciate some guidance... can't seem to find much useful stuff on the web on this. I have set up a Nutch and Solr service that is crawling a client's site. They have a lot of pages that are accessed with urls like this: