Re: Problems with indexing sub-section of a site

2008-05-24 Thread Eric J. Christeson
On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote:
 
 Hello!
 
 In short:
 
 Is it possible to tell Nutch to follow the links through one larger name
 space, but only index (add to its database) the content of links that are in
 a sub-name space of that?
 
 The background:
 
 I have started to experiment with crawling my blog with Nutch. The problem
 is that this blog doesn't have its own domain. Instead, it it is hosted on a
 larger site, which also hosts discussion forums and other people's blogs.
 
 My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought
 that adding something in the crawl-urlfilter.txt file would help. Something
 like this:
 
   +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar
 
 But look at the bottom of that page: The navigation links to the other pages
 in my blog - or to 'next' page - actually lead out of my namespace. Thus,
 they are not being picked up anymore and Nutch never sees the additional
 links that I have on those other pages.
 
 Since eventually I would like this to be a bit more generic (I don't want
 anything specific for my blog, that's just a test case), I thought that
 maybe I have to open it up to the root URL, making the filter something like
 this:
 
   +^http://([a-z0-9]*\.)*geekzone.co.nz
 
 But then it picks up a ton of other stuff that I am not interested to have
 in my database.
 
 So, now I'm wondering whether it is possible to tell Nutch to follow links
 through one namespace, but only add those pages into its index database that
 are in a specific sub-namespace of the first one?

Did a quick scan of the page in question, and I noticed the urls are of
this form:
http://www.geekzone.co.nz/blog.asp?blogid=207

Could you filter like 

+^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207

You'll have to comment out the default ? killer or put this rule before
it.

Maybe there's something I'm missing, though.

Eric

-- 
Eric J. Christeson  [EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building  
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs which
are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law


Re: Problems with indexing sub-section of a site

2008-05-24 Thread foobar3001



Eric J. Christeson-2 wrote:
 
 On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote:
 Did a quick scan of the page in question, and I noticed the urls are of
 this form:
   http://www.geekzone.co.nz/blog.asp?blogid=207
 
 Could you filter like 
 
   +^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207
 

Hello!

Thank you very much for the reply. Yes, I had noticed that as well,
but filtering site-specific URL's like that was what I wanted to avoid.
I'm trying to find a generic solution, not something that's specific
to this (or any other site).

Basically, tell the Nutch crawler to work for a certain depth through
non-specified-domain links to see if it comes back to pages belonging
to the specified domain again.

-- 
View this message in context: 
http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17451041.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Problems with indexing sub-section of a site

2008-05-22 Thread foobar3001

Hello!

In short:

Is it possible to tell Nutch to follow the links through one larger name
space, but only index (add to its database) the content of links that are in
a sub-name space of that?

The background:

I have started to experiment with crawling my blog with Nutch. The problem
is that this blog doesn't have its own domain. Instead, it it is hosted on a
larger site, which also hosts discussion forums and other people's blogs.

My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought
that adding something in the crawl-urlfilter.txt file would help. Something
like this:

  +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar

But look at the bottom of that page: The navigation links to the other pages
in my blog - or to 'next' page - actually lead out of my namespace. Thus,
they are not being picked up anymore and Nutch never sees the additional
links that I have on those other pages.

Since eventually I would like this to be a bit more generic (I don't want
anything specific for my blog, that's just a test case), I thought that
maybe I have to open it up to the root URL, making the filter something like
this:

  +^http://([a-z0-9]*\.)*geekzone.co.nz

But then it picks up a ton of other stuff that I am not interested to have
in my database.

So, now I'm wondering whether it is possible to tell Nutch to follow links
through one namespace, but only add those pages into its index database that
are in a specific sub-namespace of the first one?

Thank you very much...
-- 
View this message in context: 
http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17417650.html
Sent from the Nutch - User mailing list archive at Nabble.com.