Hosting segments in NDFS

2006-02-03 Thread Chris Schneider
Gang, Would it be possible to modify Nutch so that a set of search servers each had a local index, but that this index referred to segments living in NDFS? Doing so would allow us to skip exporting the segments from NDFS to the local FS. Of course, it would be ideal to keep the crawling machi

RE: takes too long to remove a page from WEBDB

2006-02-03 Thread Fuad Efendi
We have following code: org.apache.nutch.parse.ParseOutputFormat.java ... [94]toUrl = urlNormalizer.normalize(toUrl); [95]toUrl = URLFilters.filter(toUrl); ... It normalizes, then filters normalized URL, than writes it to /crawl_parse In some cases normalized URL is not same as raw URL,

RE: takes too long to remove a page from WEBDB

2006-02-03 Thread Fuad Efendi
It will also be generated in case if non-filtered page have "Send Redirect" to another page (which should be filtered)... I have same problem in my modified DOMContentUtils.java, ... if (url.getHost().equals(base.getHost())) { outlinks.add(..); } ... - it doesn't help, I see some URLs fro

new release doesn't have nutch-daemon.sh?

2006-02-03 Thread Mike Smith
Hi, How come recent release on SVN doesn't have nutch-daemon.sh or other batch files? Thanks, Milke

malformed URL

2006-02-03 Thread Sunnyvale Fl
I crawled some internal sites and found that URLs with '<' and '>' characters are fetched and indexed, while these are usually just bad links. I'd like to have nutch throw a malformed URL error like what it does for '[' and whitespace and some others. I know I can have '<' and '>' escaped in the r

Re: Error at end of MapReduce run with indexing

2006-02-03 Thread Ken Krugler
Hi Ken, > 4. Any idea whether 4 hours is a reasonable amount of time for this test? It seemed long to me, given that I was starting with a single > URL as the seed. > How many crawl passes did you do ? Three deep, as in: bin/nutch crawl seeds -depth 3 This was the same as Doug

Re: takes too long to remove a page from WEBDB

2006-02-03 Thread Keren Yu
Hi Stefan, As I understand, when you use 'nutch generate' to generate fetch list, it doesn't call urlfilter. Only in 'nutch updatedb' and 'nutch fetch' it does call urlfilter. So the page after 30 days will be generated even if you use url filter to filter it. Best regards, Keren --- Stefan Gros

Re: takes too long to remove a page from WEBDB

2006-02-03 Thread Stefan Groschupf
not if you filter it in the url filter. There is a database based url filter I think in the jira somewhere somehow, this can help to filter larger lists of urls. Am 03.02.2006 um 21:35 schrieb Keren Yu: Hi Stefan, Thank you. You are right. I have to use a url filter and remove it from the i

Re: takes too long to remove a page from WEBDB

2006-02-03 Thread Keren Yu
Hi Stefan, Thank you. You are right. I have to use a url filter and remove it from the index. But after 30 days later, the page will be generated again in generating fetch list. Thanks, Keren --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > And also it makes no sense, since it will come back >

Re: takes too long to remove a page from WEBDB

2006-02-03 Thread Stefan Groschupf
And also it makes no sense, since it will come back as soon the link is found on a page. Use a url filter instead and remove it from the index. Removing from webdb makes no sense. Am 03.02.2006 um 21:27 schrieb Keren Yu: Hi everyone, It took about 10 minutes to remove a page from WEBDB usin

takes too long to remove a page from WEBDB

2006-02-03 Thread Keren Yu
Hi everyone, It took about 10 minutes to remove a page from WEBDB using WebDBWriter. Does anyone know other method to remove a page, which is faster. Thanks, Keren __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around h

Re: crawler

2006-02-03 Thread Poettgen
ok, java script seems to be one problem. Thank you Andrzej. I activate the JavaSript parser and some more pages are being indexed. But the entries of the left menue are missing. Is there an other solution as building an 'sitemap'? Andrzej Bialecki <[EMAIL PROTECTED]> wrote on 03.02.2006 16:15:

Re: No score explanation for non-english characters

2006-02-03 Thread Sami Siren
Erik J wrote: I'm using Apache 2.0.55, but I don't think that the problem is in the web server. As I mentioned previously, all characters (including åäö) are displayed correctly. I think the problem is that Nutch simply doesn't calculate a score for these words. Just so that I understand you

Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread Chris Mattmann
Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called "pathSuffix". Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based craw

Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread 盖世豪侠
Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > Hi there, > > parse-rss is based on commons-feedparser > (http://jakarta.apache.org/commons/sandbox/feedparser). From the > feedp

Re: crawler

2006-02-03 Thread Andrzej Bialecki
mos wrote: The problem at www.gildemeister.com is the use of JavaScript for link generation. That's the reason why nutch can't find the other pages (the links are invisible). Two ideas: - You need something like a sitemap, that links the other main pages. If it's not available right now, you sh

Re: crawler

2006-02-03 Thread Stefan Groschupf
There is already a java script parser, you only need to switch it on. Am 03.02.2006 um 15:55 schrieb mos: The problem at www.gildemeister.com is the use of JavaScript for link generation. That's the reason why nutch can't find the other pages (the links are invisible). Two ideas: - You need som

Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread Chris Mattmann
Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: "...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.

Re: crawler

2006-02-03 Thread mos
The problem at www.gildemeister.com is the use of JavaScript for link generation. That's the reason why nutch can't find the other pages (the links are invisible). Two ideas: - You need something like a sitemap, that links the other main pages. If it's not available right now, you should try to g

Which version of rss does parse-rss plugin support?

2006-02-03 Thread 盖世豪侠
I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。

Re: crawler

2006-02-03 Thread Stefan Groschupf
Check the reg ex url filter! Your page contains symbols that are filtered. Am 03.02.2006 um 14:46 schrieb [EMAIL PROTECTED]: Hello, I have problems indexing a special internet site: http://www.gildemeister.com Nutch only fetches 14 pages but not the complete site. I'm using the default param

crawler

2006-02-03 Thread Poettgen
Hello, I have problems indexing a special internet site: http://www.gildemeister.com Nutch only fetches 14 pages but not the complete site. I'm using the default parameters and the intranet crawl command. I get no errors or so. Can someone try to index the site and can send me a hint? Or an con

Re: Updating the search index

2006-02-03 Thread Byron Miller
With all of the discussions of killing/restarting/pooling nutch bean has anyone noticed that you push your luck in doing so? I often get GC failed to collect, out of memory errors and such when trying to do anything but a clean shutdown. I'm moving to 64bit jvm and java 1.5 so i'll let you know i

How to crawl only a specific type of files?

2006-02-03 Thread 盖世豪侠
Nutch always crawls from from a parsed file to the urls contained in the file. However, if we want to crawl a specific type of files (e.g. rss file), there may be some difficulties. As the links to real rss files are always contained in some entry files of html/htm, so there is no direct urls from

Updating with Last-Modified-Since header

2006-02-03 Thread Nutch developer
Hello, just one question regarding updating the content of a crawled index. Usually you set the "db.default.fetch.interval" property for adjusting the time when a page should be refetched. Then you do a generate/fetch/updatedb and all pages that are older then the specified interval are crawled a

Re: Updating the search index

2006-02-03 Thread Raghavendra Prabhu
With respect to updating , I had also suggested another method Where we control NutchBean instantiation But i introduced it into the form of object pooling This pool will take care of reinstatiating nutch bean and returning the reference to it The pool can have a text file as an input which chan

Re: No score explanation for non-english characters

2006-02-03 Thread Erik J
Ok, thanks! /Erik From: Andrzej Bialecki <[EMAIL PROTECTED]> Reply-To: nutch-user@lucene.apache.org To: nutch-user@lucene.apache.org Subject: Re: No score explanation for non-english characters Date: Fri, 03 Feb 2006 09:53:39 +0100 Erik J wrote: I'm using Apache 2.0.55, but I don't think that

Re: No score explanation for non-english characters

2006-02-03 Thread Andrzej Bialecki
Erik J wrote: I'm using Apache 2.0.55, but I don't think that the problem is in the web server. As I mentioned previously, all characters (including åäö) are displayed correctly. I think the problem is that Nutch simply doesn't calculate a score for these words. No. The problem is in the sear

Wrong 'Next Fetch' Date

2006-02-03 Thread mos
Hello, just a view days ago we started to use Nutch (0.7.1). It's really nice and I would like to see it evolve. Here's my issue/question: While fetching our URLs, we got some errors like this: 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exceptio

Re: No score explanation for non-english characters

2006-02-03 Thread Erik J
I'm using Apache 2.0.55, but I don't think that the problem is in the web server. As I mentioned previously, all characters (including åäö) are displayed correctly. I think the problem is that Nutch simply doesn't calculate a score for these words. Just so that I understand you correctly: you