Gang,
Would it be possible to modify Nutch so that a set of search servers
each had a local index, but that this index referred to segments
living in NDFS? Doing so would allow us to skip exporting the
segments from NDFS to the local FS. Of course, it would be ideal to
keep the crawling machi
We have following code:
org.apache.nutch.parse.ParseOutputFormat.java
...
[94]toUrl = urlNormalizer.normalize(toUrl);
[95]toUrl = URLFilters.filter(toUrl);
...
It normalizes, then filters normalized URL, than writes it to /crawl_parse
In some cases normalized URL is not same as raw URL,
It will also be generated in case if non-filtered page have "Send Redirect"
to another page (which should be filtered)...
I have same problem in my modified DOMContentUtils.java,
...
if (url.getHost().equals(base.getHost())) { outlinks.add(..); }
...
- it doesn't help, I see some URLs fro
Hi,
How come recent release on SVN doesn't have nutch-daemon.sh or other batch
files?
Thanks, Milke
I crawled some internal sites and found that URLs with '<' and '>'
characters are fetched and indexed, while these are usually just bad links.
I'd like to have nutch throw a malformed URL error like what it does for '['
and whitespace and some others. I know I can have '<' and '>' escaped in
the r
Hi Ken,
> 4. Any idea whether 4 hours is a reasonable amount of time for this
test? It seemed long to me, given that I was starting with a single
> URL as the seed.
>
How many crawl passes did you do ?
Three deep, as in: bin/nutch crawl seeds -depth 3
This was the same as Doug
Hi Stefan,
As I understand, when you use 'nutch generate' to
generate fetch list, it doesn't call urlfilter. Only
in 'nutch updatedb' and 'nutch fetch' it does call
urlfilter. So the page after 30 days will be generated
even if you use url filter to filter it.
Best regards,
Keren
--- Stefan Gros
not if you filter it in the url filter.
There is a database based url filter I think in the jira somewhere
somehow, this can help to filter larger lists of urls.
Am 03.02.2006 um 21:35 schrieb Keren Yu:
Hi Stefan,
Thank you. You are right. I have to use a url filter
and remove it from the i
Hi Stefan,
Thank you. You are right. I have to use a url filter
and remove it from the index. But after 30 days later,
the page will be generated again in generating fetch
list.
Thanks,
Keren
--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> And also it makes no sense, since it will come back
>
And also it makes no sense, since it will come back as soon the link
is found on a page.
Use a url filter instead and remove it from the index.
Removing from webdb makes no sense.
Am 03.02.2006 um 21:27 schrieb Keren Yu:
Hi everyone,
It took about 10 minutes to remove a page from WEBDB
usin
Hi everyone,
It took about 10 minutes to remove a page from WEBDB
using WebDBWriter. Does anyone know other method to
remove a page, which is faster.
Thanks,
Keren
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
h
ok, java script seems to be one problem. Thank you Andrzej.
I activate the JavaSript parser and some more pages are being indexed. But
the entries of the left menue are missing.
Is there an other solution as building an 'sitemap'?
Andrzej Bialecki <[EMAIL PROTECTED]> wrote on 03.02.2006 16:15:
Erik J wrote:
I'm using Apache 2.0.55, but I don't think that the problem is in the
web server. As I mentioned previously, all characters (including åäö)
are displayed correctly. I think the problem is that Nutch simply
doesn't calculate a score for these words.
Just so that I understand you
Hi there,
Sure it will, you just have to configure it to do that. Pop over to
$NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is
an attribute called "pathSuffix". Change that to handle whatever type of rss
file you want to crawl. That will work locally. For web-based craw
Hi *Chris,*
The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it
automatically as a rss file?
在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道:
>
> Hi there,
>
> parse-rss is based on commons-feedparser
> (http://jakarta.apache.org/commons/sandbox/feedparser). From the
> feedp
mos wrote:
The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need something like a sitemap, that links the other main pages.
If it's not available
right now, you sh
There is already a java script parser, you only need to switch it on.
Am 03.02.2006 um 15:55 schrieb mos:
The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need som
Hi there,
parse-rss is based on commons-feedparser
(http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser
website:
"...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0,
and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension
and RSS 1.
The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need something like a sitemap, that links the other main pages.
If it's not available
right now, you should try to g
I see the test file is of version 0.91.
Does the plugin support higher versions like 1.0 or 2.0?
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Check the reg ex url filter!
Your page contains symbols that are filtered.
Am 03.02.2006 um 14:46 schrieb [EMAIL PROTECTED]:
Hello,
I have problems indexing a special internet site:
http://www.gildemeister.com
Nutch only fetches 14 pages but not the complete site.
I'm using the default param
Hello,
I have problems indexing a special internet site:
http://www.gildemeister.com
Nutch only fetches 14 pages but not the complete site.
I'm using the default parameters and the intranet crawl command.
I get no errors or so. Can someone try to index the site and can send me a
hint?
Or an con
With all of the discussions of
killing/restarting/pooling nutch bean has anyone
noticed that you push your luck in doing so?
I often get GC failed to collect, out of memory errors
and such when trying to do anything but a clean
shutdown.
I'm moving to 64bit jvm and java 1.5 so i'll let you
know i
Nutch always crawls from from a parsed file to the urls contained in the
file. However, if we want to crawl a specific type of files (e.g. rss file),
there may be some difficulties. As the links to real rss files are always
contained in some entry files of html/htm, so there is no direct urls from
Hello,
just one question regarding updating the content of a
crawled index.
Usually you set the "db.default.fetch.interval" property
for adjusting the time when a page should be refetched.
Then you do a generate/fetch/updatedb and all pages
that are older then the specified interval are crawled a
With respect to updating , I had also suggested another method
Where we control NutchBean instantiation
But i introduced it into the form of object pooling
This pool will take care of reinstatiating nutch bean and returning the
reference to it
The pool can have a text file as an input which chan
Ok, thanks!
/Erik
From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: No score explanation for non-english characters
Date: Fri, 03 Feb 2006 09:53:39 +0100
Erik J wrote:
I'm using Apache 2.0.55, but I don't think that
Erik J wrote:
I'm using Apache 2.0.55, but I don't think that the problem is in the
web server. As I mentioned previously, all characters (including åäö)
are displayed correctly. I think the problem is that Nutch simply
doesn't calculate a score for these words.
No. The problem is in the sear
Hello,
just a view days ago we started to use Nutch (0.7.1).
It's really nice and I would like to see it evolve.
Here's my issue/question:
While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html
failed with: java.lang.Exceptio
I'm using Apache 2.0.55, but I don't think that the problem is in the web
server. As I mentioned previously, all characters (including åäö) are
displayed correctly. I think the problem is that Nutch simply doesn't
calculate a score for these words.
Just so that I understand you correctly: you
30 matches
Mail list logo