How to deal with javascript urls?

2006-04-19 Thread Elwin
for example: a href=javascript:customCss(6017162) id=customCssMenu test/a in fact, can nutch get content from such kind of urls?

Re: java.net.SocketTimeoutException: Read timed out

2006-04-14 Thread Elwin
Oh. Thank you very much. 在06-4-14,Raghavendra Prabhu [EMAIL PROTECTED] 写道: Hi Elwin Just switch it to protocol-http in the conf file. (nutch-default.xml file) If you dont want to use threaded thing, change the number of threads in the configuration file. Have a limited number of threads

Re: java.net.SocketTimeoutException: Read timed out

2006-04-13 Thread Elwin
In fact I'm not using the fetcher of nutch and I just call the HttpResponse in my own code, which is not multi-thread. 2006/4/13, Doug Cutting [EMAIL PROTECTED]: Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I

Re: java.net.SocketTimeoutException: Read timed out

2006-04-13 Thread Elwin
instead of protocol-httpclient seems to be fixing the problem. I am not sure but the above thing seemed to fix the problem Rgds Prabhu On 4/13/06, Elwin [EMAIL PROTECTED] wrote: In fact I'm not using the fetcher of nutch and I just call the HttpResponse in my own code, which

java.net.SocketTimeoutException: Read timed out

2006-04-12 Thread Elwin
When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file?

Inject url into a temp webdb

2006-03-18 Thread Elwin
WebDBInjector injector = new WebDBInjector(dbWriter); I dynamically use the injector to inject urls into a temp empty webdb. Then I use Enumeration e = webdb.pages() to dump urls from that webdb, but it seems that I get nothing? Need I update the webdb after I inject urls? if so, how to update?

find duplicate urls in webdb

2006-03-05 Thread Elwin
When I read pages out of a webdb and printed out the url of each page, I found two urls are just the same. Is it possible that two pages with the same url? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。

Re: About regex in the crawl-urlfilter.txt config file

2006-02-23 Thread Elwin
Oh I have asked a silly question about regex, hehe. 2006/2/23, Jack Tang [EMAIL PROTECTED]: Hi I think in the url-filter it uses contain rather than match. /Jack On 2/23/06, Elwin [EMAIL PROTECTED] wrote: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME

Why Perl5 regular expressions?

2006-02-22 Thread Elwin
Why the url filter of nutch use Perl5 regular expressions? Any benefits? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
/property Regards Piotr Guenter, Matthias wrote: Hi Elwin Did you check the content limit? Otherwise the truncation occurs naturally, I guess property namehttp.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. 2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]: Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin
I will try it. Many thanks. 2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]: Elwin wrote: No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. I often find TagSoup performing better than NekoHTML. In case of some grave HTML

Re: No Accents

2006-02-20 Thread Elwin
I think maybe you could add a mapping between these letters. 2006/2/20, Franz Werfel [EMAIL PROTECTED]: Hello, Sorry this is probably in the documentation somewhere, but I couldn't find it. How to index and search accented words without accents? For example: Portégé (a model for Toshiba

Re: Content-based Crawl vs Link-based Crawl?

2006-02-19 Thread Elwin
Hi Howie, Thank you for valuable suggestion. I will consider it carefully. As I'm going to parse non-English (actually Chinese) pages, so I think maybe regular expressions are not very useful to me. I decide to integrate some simple date mining techniques to achieve it. 2006/2/19, Howie

Content-based Crawl vs Link-based Crawl?

2006-02-18 Thread Elwin
As nutch crawls web pages from links to links by extracting outlinks from the page. For example, we can check if the link text contains some keywords from a dictionary to decide whether or not to crawl it. Moreover, we can check if the content of a page fetched by an outlink contains some keywords

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
contents and just see the html elements. 2006/2/17, Guenter, Matthias [EMAIL PROTECTED]: Hi Elwin Can you provide samples of not working links and code? And put it into JIRA? Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Fr 17.02.2006

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin
Hi *Guenter* I think you are right. Although I haven't restarted code, but I have checked the last url I got from that page, which is just in the middle of the page, so it seems that the page has been truncated. Many thanks! 在06-2-17,Guenter, Matthias [EMAIL PROTECTED] 写道: Hi Elwin Did you

Question about fExtensionPoints in PluginRepository.java

2006-02-15 Thread Elwin
fExtensionPoints is a HashMap. How about two plugins that extend the same Extension Point for the code fExtensionPoints.put(xpId, point)?

Re: Duplicate urls in urls file

2006-02-15 Thread Elwin
Did you achieve it by extending nutch with a plugin? I think it's possible to achieve it in a URLFilter plugin to filter rss feed links. 2006/2/16, Hasan Diwan [EMAIL PROTECTED]: Elwin: On 13/02/06, Elwin [EMAIL PROTECTED] wrote: Do you use fixed set of rss feeds for crawl or discover rss

Re: Duplicate urls in urls file

2006-02-13 Thread Elwin
Hi, Hasan Do you use fixed set of rss feeds for crawl or discover rss feeds dynamically? 2006/2/14, Hasan Diwan [EMAIL PROTECTED]: I've written a perl script to build up a urls file to crawl from RSS feeds. Will nutch handle duplicate URLs in the crawl file or would that logic need to

Problem in debugging codes that using nutch api

2006-02-12 Thread Elwin
I have written some test codes using nutch api. As nutch-default.xml and nutch-site.xml are included in nutch-0.7.jar, can I debug my code with these files in a conf dir instead of binding in the jar file? Besides, how can I refer to other files like mime-types.xml in my code? Where does NutchConf

Why are other config files not included in nutch-0.7.jar

2006-02-12 Thread Elwin
other than nutch-default.xml and nutch-site.xml.

How to control contents to be indexed?

2006-02-10 Thread Elwin
In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend?

Re: How to control contents to be indexed?

2006-02-10 Thread Elwin
to see what your options are. Jake. -Original Message- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Friday, February 10, 2006 4:38 AM To: nutch-user@lucene.apache.org Subject: How to control contents to be indexed? In the process of crawling and indexing, some pages are just used

Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Elwin
According to the code: theOutlinks.add(new Outlink(r.getLink(), r .getDescription())); I can see that item description is also included. However, when I tried with this feed: http://kgrimm.bravejournal.com/feed.rss I can only get the title