Re: Problems with indexing sub-section of a site
On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote: Hello! In short: Is it possible to tell Nutch to follow the links through one larger name space, but only index (add to its database) the content of links that are in a sub-name space of that? The background: I have started to experiment with crawling my blog with Nutch. The problem is that this blog doesn't have its own domain. Instead, it it is hosted on a larger site, which also hosts discussion forums and other people's blogs. My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought that adding something in the crawl-urlfilter.txt file would help. Something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar But look at the bottom of that page: The navigation links to the other pages in my blog - or to 'next' page - actually lead out of my namespace. Thus, they are not being picked up anymore and Nutch never sees the additional links that I have on those other pages. Since eventually I would like this to be a bit more generic (I don't want anything specific for my blog, that's just a test case), I thought that maybe I have to open it up to the root URL, making the filter something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz But then it picks up a ton of other stuff that I am not interested to have in my database. So, now I'm wondering whether it is possible to tell Nutch to follow links through one namespace, but only add those pages into its index database that are in a specific sub-namespace of the first one? Did a quick scan of the page in question, and I noticed the urls are of this form: http://www.geekzone.co.nz/blog.asp?blogid=207 Could you filter like +^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207 You'll have to comment out the default ? killer or put this rule before it. Maybe there's something I'm missing, though. Eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations. (For example, if you have four groups working on a compiler, you'll get a 4-pass compiler) - Conway's Law
Re: Problems with indexing sub-section of a site
Eric J. Christeson-2 wrote: On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote: Did a quick scan of the page in question, and I noticed the urls are of this form: http://www.geekzone.co.nz/blog.asp?blogid=207 Could you filter like +^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207 Hello! Thank you very much for the reply. Yes, I had noticed that as well, but filtering site-specific URL's like that was what I wanted to avoid. I'm trying to find a generic solution, not something that's specific to this (or any other site). Basically, tell the Nutch crawler to work for a certain depth through non-specified-domain links to see if it comes back to pages belonging to the specified domain again. -- View this message in context: http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17451041.html Sent from the Nutch - User mailing list archive at Nabble.com.
Problems with indexing sub-section of a site
Hello! In short: Is it possible to tell Nutch to follow the links through one larger name space, but only index (add to its database) the content of links that are in a sub-name space of that? The background: I have started to experiment with crawling my blog with Nutch. The problem is that this blog doesn't have its own domain. Instead, it it is hosted on a larger site, which also hosts discussion forums and other people's blogs. My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought that adding something in the crawl-urlfilter.txt file would help. Something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar But look at the bottom of that page: The navigation links to the other pages in my blog - or to 'next' page - actually lead out of my namespace. Thus, they are not being picked up anymore and Nutch never sees the additional links that I have on those other pages. Since eventually I would like this to be a bit more generic (I don't want anything specific for my blog, that's just a test case), I thought that maybe I have to open it up to the root URL, making the filter something like this: +^http://([a-z0-9]*\.)*geekzone.co.nz But then it picks up a ton of other stuff that I am not interested to have in my database. So, now I'm wondering whether it is possible to tell Nutch to follow links through one namespace, but only add those pages into its index database that are in a specific sub-namespace of the first one? Thank you very much... -- View this message in context: http://www.nabble.com/Problems-with-indexing-sub-section-of-a-site-tp17417650p17417650.html Sent from the Nutch - User mailing list archive at Nabble.com.