for example:
a href=javascript:customCss(6017162) id=customCssMenu test/a
in fact, can nutch get content from such kind of urls?
Oh. Thank you very much.
在06-4-14,Raghavendra Prabhu [EMAIL PROTECTED] 写道:
Hi Elwin
Just switch it to protocol-http in the conf file. (nutch-default.xml file)
If you dont want to use threaded thing, change the number of threads in
the
configuration file.
Have a limited number of threads
In fact I'm not using the fetcher of nutch and I just call the HttpResponse
in my own code, which is not multi-thread.
2006/4/13, Doug Cutting [EMAIL PROTECTED]:
Elwin wrote:
When I use the httpclient.HttpResponse to get http content in nutch, I
often
get SocketTimeoutExceptions.
Can I
instead of
protocol-httpclient seems to be fixing the problem.
I am not sure but the above thing seemed to fix the problem
Rgds
Prabhu
On 4/13/06, Elwin [EMAIL PROTECTED] wrote:
In fact I'm not using the fetcher of nutch and I just call the
HttpResponse
in my own code, which
When I use the httpclient.HttpResponse to get http content in nutch, I often
get SocketTimeoutExceptions.
Can I solve this problem by enlarging the value of http.timeout in conf
file?
WebDBInjector injector = new WebDBInjector(dbWriter);
I dynamically use the injector to inject urls into a temp empty webdb.
Then I use Enumeration e = webdb.pages() to dump urls from that webdb, but
it seems that I get nothing?
Need I update the webdb after I inject urls? if so, how to update?
When I read pages out of a webdb and printed out the url of each page, I
found two urls are just the same.
Is it possible that two pages with the same url?
--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
Oh I have asked a silly question about regex, hehe.
2006/2/23, Jack Tang [EMAIL PROTECTED]:
Hi
I think in the url-filter it uses contain rather than match.
/Jack
On 2/23/06, Elwin [EMAIL PROTECTED] wrote:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME
Why the url filter of nutch use Perl5 regular expressions? Any benefits?
--
《盖世豪侠》好评如潮,让无线收视居高不下,
无线高兴之余,仍未重用。周星驰岂是池中物,
喜剧天分既然崭露,当然不甘心受冷落,于是
转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然后悔莫及。
/property
Regards
Piotr
Guenter, Matthias wrote:
Hi Elwin
Did you check the content limit?
Otherwise the truncation occurs naturally, I guess
property
namehttp.content.limit/name
value65536/value
descriptionThe length limit for downloaded content, in bytes.
If this value
No I don't try to do that. I just use the default paser for the plguin. It
seems that it works well now.
Thx.
2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]:
Elwin wrote:
Yes, it's true, although it's not the cause of my problem.
Did you try to use the alternative HTML parser (TagSoup
I will try it. Many thanks.
2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]:
Elwin wrote:
No I don't try to do that. I just use the default paser for the plguin.
It
seems that it works well now.
Thx.
I often find TagSoup performing better than NekoHTML. In case of some
grave HTML
I think maybe you could add a mapping between these letters.
2006/2/20, Franz Werfel [EMAIL PROTECTED]:
Hello,
Sorry this is probably in the documentation somewhere, but I couldn't find
it.
How to index and search accented words without accents?
For example: Portégé (a model for Toshiba
Hi Howie,
Thank you for valuable suggestion. I will consider it carefully.
As I'm going to parse non-English (actually Chinese) pages, so I think
maybe regular expressions are not very useful to me. I decide to integrate
some simple date mining techniques to achieve it.
2006/2/19, Howie
As nutch crawls web pages from links to links by extracting outlinks from
the page.
For example, we can check if the link text contains some keywords from a
dictionary to decide whether or not to crawl it.
Moreover, we can check if the content of a page fetched by an outlink
contains some keywords
contents and just see the html elements.
2006/2/17, Guenter, Matthias [EMAIL PROTECTED]:
Hi Elwin
Can you provide samples of not working links and code? And put it into
JIRA?
Kind regards
Matthias
-Ursprüngliche Nachricht-
Von: Elwin [mailto:[EMAIL PROTECTED]
Gesendet: Fr 17.02.2006
Hi *Guenter*
I think you are right. Although I haven't restarted code, but I have
checked the last url I got from that page, which is just in the middle of
the page, so it seems that the page has been truncated.
Many thanks!
在06-2-17,Guenter, Matthias [EMAIL PROTECTED] 写道:
Hi Elwin
Did you
fExtensionPoints is a HashMap.
How about two plugins that extend the same Extension Point for the code
fExtensionPoints.put(xpId, point)?
Did you achieve it by extending nutch with a plugin?
I think it's possible to achieve it in a URLFilter plugin to filter rss feed
links.
2006/2/16, Hasan Diwan [EMAIL PROTECTED]:
Elwin:
On 13/02/06, Elwin [EMAIL PROTECTED] wrote:
Do you use fixed set of rss feeds for crawl or discover rss
Hi, Hasan
Do you use fixed set of rss feeds for crawl or discover rss feeds
dynamically?
2006/2/14, Hasan Diwan [EMAIL PROTECTED]:
I've written a perl script to build up a urls file to crawl from RSS
feeds. Will nutch handle duplicate URLs in the crawl file or would
that logic need to
I have written some test codes using nutch api.
As nutch-default.xml and nutch-site.xml are included in nutch-0.7.jar, can I
debug my code with these files in a conf dir instead of binding in the jar
file?
Besides, how can I refer to other files like mime-types.xml in my code?
Where does NutchConf
other than nutch-default.xml and nutch-site.xml.
In the process of crawling and indexing, some pages are just used as
temporary links to the pages I want to index, so how can I control those
kinds of pages not being indexed? Or which part of nutch should I extend?
to see what your options are.
Jake.
-Original Message-
From: Elwin [mailto:[EMAIL PROTECTED]
Sent: Friday, February 10, 2006 4:38 AM
To: nutch-user@lucene.apache.org
Subject: How to control contents to be indexed?
In the process of crawling and indexing, some pages are just used
According to the code:
theOutlinks.add(new Outlink(r.getLink(), r
.getDescription()));
I can see that item description is also included.
However, when I tried with this feed:
http://kgrimm.bravejournal.com/feed.rss
I can only get the title
25 matches
Mail list logo