from:"Elwin"

How to deal with javascript urls?

2006-04-19 Thread Elwin

for example:
a href=javascript:customCss(6017162) id=customCssMenu test/a
in fact, can nutch get content from such kind of urls?

Re: java.net.SocketTimeoutException: Read timed out

2006-04-14 Thread Elwin

Oh. Thank you very much.

在06-4-14，Raghavendra Prabhu [EMAIL PROTECTED] 写道：

 Hi Elwin

 Just switch it to protocol-http in the conf file. (nutch-default.xml file)

 If you dont want to use threaded thing, change the number of threads in
 the
 configuration file.

 Have a limited number of threads fetching (Like as doug said)

 Rgds
 Prabhu

 On 4/14/06, Elwin [EMAIL PROTECTED] wrote:
 
  Hi Raghavendra
 
  Then how to use protocol-http instead of protocol-httpclient?
  Can I still use HttpResponse?
 
  在 06-4-13，Raghavendra Prabhu[EMAIL PROTECTED] 写道：
   Hi Doug
  
   I am not sure whether this problem is entirely with bandwidth starving
  
   In some cases, having the protocol as protocol-http instead of
   protocol-httpclient seems to be fixing the problem.
  
   I am not sure but the above thing seemed to fix the problem
  
   Rgds
   Prabhu
  
  
   On 4/13/06, Elwin [EMAIL PROTECTED] wrote:
   
In fact I'm not using the fetcher of nutch and I just call the
HttpResponse
in my own code, which is not multi-thread.
   
2006/4/13, Doug Cutting [EMAIL PROTECTED]:

 Elwin wrote:
  When I use the httpclient.HttpResponse to get http content in
  nutch, I
 often
  get SocketTimeoutExceptions.
  Can I solve this problem by enlarging the value of http.timeoutin
conf
  file?

 Perhaps, if you're working with slow sites.  But, more likely,
  you're
 using too many fetcher threads and exceeding your available
  bandwidth,
 causing threads to starve and timeout.

 Doug

   
   
   
--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。
   
  
 
 
  --
  《盖世豪侠》好评如潮，让无线收视居高不下，
  无线高兴之余，仍未重用。周星驰岂是池中物，
  喜剧天分既然崭露，当然不甘心受冷落，于是
  转投电影界，在大银幕上一展风采。无线既得
  千里马，又失千里马，当然后悔莫及。
 




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: java.net.SocketTimeoutException: Read timed out

2006-04-13 Thread Elwin

In fact I'm not using the fetcher of nutch and I just call the HttpResponse
in my own code, which is not multi-thread.

2006/4/13, Doug Cutting [EMAIL PROTECTED]:

 Elwin wrote:
  When I use the httpclient.HttpResponse to get http content in nutch, I
 often
  get SocketTimeoutExceptions.
  Can I solve this problem by enlarging the value of http.timeout in conf
  file?

 Perhaps, if you're working with slow sites.  But, more likely, you're
 using too many fetcher threads and exceeding your available bandwidth,
 causing threads to starve and timeout.

 Doug




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: java.net.SocketTimeoutException: Read timed out

2006-04-13 Thread Elwin

Hi Raghavendra

 Then how to use protocol-http instead of protocol-httpclient?
Can I still use HttpResponse?

在 06-4-13，Raghavendra Prabhu[EMAIL PROTECTED] 写道：
 Hi Doug

 I am not sure whether this problem is entirely with bandwidth starving

 In some cases, having the protocol as protocol-http instead of
 protocol-httpclient seems to be fixing the problem.

 I am not sure but the above thing seemed to fix the problem

 Rgds
 Prabhu


 On 4/13/06, Elwin [EMAIL PROTECTED] wrote:
 
  In fact I'm not using the fetcher of nutch and I just call the
  HttpResponse
  in my own code, which is not multi-thread.
 
  2006/4/13, Doug Cutting [EMAIL PROTECTED]:
  
   Elwin wrote:
When I use the httpclient.HttpResponse to get http content in nutch, I
   often
get SocketTimeoutExceptions.
Can I solve this problem by enlarging the value of http.timeout in
  conf
file?
  
   Perhaps, if you're working with slow sites.  But, more likely, you're
   using too many fetcher threads and exceeding your available bandwidth,
   causing threads to starve and timeout.
  
   Doug
  
 
 
 
  --
  《盖世豪侠》好评如潮，让无线收视居高不下，
  无线高兴之余，仍未重用。周星驰岂是池中物，
  喜剧天分既然崭露，当然不甘心受冷落，于是
  转投电影界，在大银幕上一展风采。无线既得
  千里马，又失千里马，当然后悔莫及。
 



--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

java.net.SocketTimeoutException: Read timed out

2006-04-12 Thread Elwin

When I use the httpclient.HttpResponse to get http content in nutch, I often
get SocketTimeoutExceptions.
Can I solve this problem by enlarging the value of http.timeout in conf
file?

Inject url into a temp webdb

2006-03-18 Thread Elwin

WebDBInjector injector = new WebDBInjector(dbWriter);
I dynamically use the injector to inject urls into a temp empty webdb.
Then I use Enumeration e = webdb.pages() to dump urls from that webdb, but
it seems that I get nothing?
Need I update the webdb after I inject urls? if so, how to update?

find duplicate urls in webdb

2006-03-05 Thread Elwin

When I read pages out of a webdb and printed out the url of each page, I
found two urls  are just the same.
Is it possible that two pages with the same url?

--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: About regex in the crawl-urlfilter.txt config file

2006-02-23 Thread Elwin

Oh I have asked a silly question about regex, hehe.

2006/2/23, Jack Tang [EMAIL PROTECTED]:

 Hi

 I think in the url-filter it uses contain rather than match.

 /Jack

 On 2/23/06, Elwin [EMAIL PROTECTED] wrote:
  # accept hosts in MY.DOMAIN.NAME
  +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
 
  Will this pattern accept url like this
 http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/?
  I think it's not, but in fact nutch can crawl and get urls like that in
  intranet crawl. Why?
 
 


 --
 Keep Discovering ... ...
 http://www.jroller.com/page/jmars




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Why Perl5 regular expressions?

2006-02-22 Thread Elwin

Why the url filter of nutch use Perl5 regular expressions? Any benefits?

--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin

Yes, it's true, although it's not the cause of my problem.

在06-2-20，Piotr Kosiorowski [EMAIL PROTECTED] 写道：

 Hello,
 One more thing to check:
 property
 namedb.max.outlinks.per.page/name
 value100/value
 descriptionThe maximum number of outlinks that we'll process for a page.
 /description
 /property

 Regards
 Piotr
 Guenter, Matthias wrote:
  Hi Elwin
  Did you check the content limit?
  Otherwise the truncation occurs naturally, I guess
 
  property
namehttp.content.limit/name
value65536/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
 truncated;
otherwise, no truncation at all.
   /description
  /property
 
  Kind regards
 
  Matthias
  -Ursprüngliche Nachricht-
  Von: Elwin [mailto:[EMAIL PROTECTED]
  Gesendet: Freitag, 17. Februar 2006 09:36
  An: nutch-user@lucene.apache.org
  Betreff: Re: extract links problem with parse-html plugin
 
  I have wrote a test class HtmlWrapper and here is some code:
 
HtmlWrapper wrapper=new HtmlWrapper();
Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;);
String temp=new String(c.getContent());
System.out.println(temp);
 
wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
ArrayList links=wrapper.getBlogLinks();
for(int i=0;ilinks.size();i++){
 String urlString=(String)links.get(i);
 System.out.println(urlString);
}
 
  I can only get a few of links from that page.
 
  The url is from a Chinese site; however you can just skip those
 non-Enligsh
  contents and just see the html elements.
 
  2006/2/17, Guenter, Matthias [EMAIL PROTECTED]:
  Hi Elwin
  Can you provide samples of not working links and code? And put it into
  JIRA?
  Kind regards
  Matthias
 
 
 
  -Ursprüngliche Nachricht-
  Von: Elwin [mailto:[EMAIL PROTECTED]
  Gesendet: Fr 17.02.2006 08:51
  An: nutch-user@lucene.apache.org
  Betreff: extract links problem with parse-html plugin
 
  It seems that the parse-html plguin may not process many pages well,
  because
  I have found that the plugin can't extract all valid links in a page
 when
  I
  test it in my code.
  I guess that it may be caused by the style of a html page? When I view
  source of a html page I used to parse, I saw that some elements in the
  source are segmented by some unrequired spaces. However, the situation
 is
  quiet often to the pages of large portal sites or news sites.
 
 
 
 
  --
  《盖世豪侠》好评如潮，让无线收视居高不下，
  无线高兴之余，仍未重用。周星驰岂是池中物，
  喜剧天分既然崭露，当然不甘心受冷落，于是
  转投电影界，在大银幕上一展风采。无线既得
  千里马，又失千里马，当然后悔莫及。
 





--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin

No I don't try to do that. I just use the default paser for the plguin. It
seems that it works well now.
Thx.

2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]:

 Elwin wrote:
  Yes, it's true, although it's not the cause of my problem.
 

 Did you try to use the alternative HTML parser (TagSoup) supported by
 the plugin? You need to set a property parser.html.impl to tagsoup.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Elwin

I will try it. Many thanks.

2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]:

 Elwin wrote:
  No I don't try to do that. I just use the default paser for the plguin.
 It
  seems that it works well now.
  Thx.
 

 I often find TagSoup performing better than NekoHTML. In case of some
 grave HTML errors Neko tends to simply truncate the document, while
 TagSoup just keeps on truckin'. This is especially true for pages with
 multiple html elements, where Neko ignores all elements but the first
 one, while TagSoup just treats any html elements inside a document
 like any other nested element.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: No Accents

2006-02-20 Thread Elwin

I think maybe you could add a mapping between these letters.

2006/2/20, Franz Werfel [EMAIL PROTECTED]:

 Hello,

 Sorry this is probably in the documentation somewhere, but I couldn't find
 it.

 How to index and search accented words without accents?

 For example: Portégé (a model for Toshiba laptops) would be indexed
 as portege; and the search for portégé would be equivalent to the
 search for portege and find either Portégé, Portegé, portége,
 portege, etc.

 This is how Google works; maybe Nutch do the same by default?

 Currently, by default (0.7.1), Portégé is indexed as portégé and
 found only if searched for portégé or Portégé (but not portege).

 This is all the most useful considering users in the US do not have
 easy access to accented letters on their keywords...

 Thanks,
 Frank.




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: Content-based Crawl vs Link-based Crawl?

2006-02-19 Thread Elwin

Hi Howie,

  Thank you for valuable suggestion. I will consider it carefully.
  As I'm going to parse non-English (actually Chinese) pages, so I think
maybe  regular expressions are not very useful to me. I decide to integrate
some simple date mining techniques to achieve it.


2006/2/19, Howie Wang [EMAIL PROTECTED]:

 I think doing this sort of thing works out very well for niche search
 engines.
 Analyzing the contents of the page takes up some time, but it's just
 milliseconds
 per page. If you contrast this with actually fetching a page that you
 don't
 want
 (several seconds * num pages), you can see that the time savings are very
 much
 in your favor.

 I'm not sure if you'd create a URLFilter since I don't think that gives
 you
 easy
 access to the page contents. You could do it in an HtmlParseFilter. Just
 copy the
 parse-html plugin, look for the bit of code where the Outlinks array is
 set.
 Then filter
 that Outlinks array as you see fit.

 One thing to be careful about is using regular expressions in Java to
 analyze the
 page contents. I've had lots of problems with hanging using
 java.util.regex.
 I get
 this with perfectly legal regex's, and it's only on certain pages that I
 get
 problems.
 It's not as big a problem for me since most of my regex stuff is during
 the
 indexing phase, and it's easy to re-index. If it happens during the fetch,
 it's a bigger
 pain, since you have to recover from an aborted fetch. So you might want
 to
 do lots of small crawls, instead of big full crawls.

 Howie


 I think this can be done by using a plug-in like url filter, but it seems
 to
 cause the performance problem of the crawling process. So I'd like to
 listen
 to your opinions. Is it possible or meaningful to crawl not just by links
 but contents or terms?





--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Content-based Crawl vs Link-based Crawl?

2006-02-18 Thread Elwin

As nutch crawls web pages from links to links by extracting outlinks from
the page.
For example, we can check if the link text contains some keywords from a
dictionary to decide whether or not to crawl it.
Moreover, we can check if the content of a page fetched by an outlink
contains some keywords from a dictionary.

I think this can be done by using a plug-in like url filter, but it seems to
cause the performance problem of the crawling process. So I'd like to listen
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin

I have wrote a test class HtmlWrapper and here is some code:

  HtmlWrapper wrapper=new HtmlWrapper();
  Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;);
  String temp=new String(c.getContent());
  System.out.println(temp);

  wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
  ArrayList links=wrapper.getBlogLinks();
  for(int i=0;ilinks.size();i++){
   String urlString=(String)links.get(i);
   System.out.println(urlString);
  }

I can only get a few of links from that page.

The url is from a Chinese site; however you can just skip those non-Enligsh
contents and just see the html elements.

2006/2/17, Guenter, Matthias [EMAIL PROTECTED]:

 Hi Elwin
 Can you provide samples of not working links and code? And put it into
 JIRA?
 Kind regards
 Matthias



 -Ursprüngliche Nachricht-
 Von: Elwin [mailto:[EMAIL PROTECTED]
 Gesendet: Fr 17.02.2006 08:51
 An: nutch-user@lucene.apache.org
 Betreff: extract links problem with parse-html plugin

 It seems that the parse-html plguin may not process many pages well,
 because
 I have found that the plugin can't extract all valid links in a page when
 I
 test it in my code.
 I guess that it may be caused by the style of a html page? When I view
 source of a html page I used to parse, I saw that some elements in the
 source are segmented by some unrequired spaces. However, the situation is
 quiet often to the pages of large portal sites or news sites.




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: extract links problem with parse-html plugin

2006-02-17 Thread Elwin

Hi *Guenter*

 I think you are right. Although I haven't restarted code, but I have
checked the last url I got from that page, which is just in the middle of
the page, so it seems that the page has been truncated.
Many thanks!


在06-2-17，Guenter, Matthias [EMAIL PROTECTED] 写道：

 Hi Elwin
 Did you check the content limit?
 Otherwise the truncation occurs naturally, I guess

 property
 namehttp.content.limit/name
 value65536/value
 descriptionThe length limit for downloaded content, in bytes.
 If this value is nonnegative (=0), content longer than it will be
 truncated;
 otherwise, no truncation at all.
 /description
 /property

 Kind regards

 Matthias
 -Ursprüngliche Nachricht-
 Von: Elwin [mailto:[EMAIL PROTECTED]
 Gesendet: Freitag, 17. Februar 2006 09:36
 An: nutch-user@lucene.apache.org
 Betreff: Re: extract links problem with parse-html plugin

 I have wrote a test class HtmlWrapper and here is some code:

 HtmlWrapper wrapper=new HtmlWrapper();
 Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;);
 String temp=new String(c.getContent());
 System.out.println(temp);

 wrapper.parseHttpContent(c); // get all outlinks into a ArrayList
 ArrayList links=wrapper.getBlogLinks();
 for(int i=0;ilinks.size();i++){
   String urlString=(String)links.get(i);
   System.out.println(urlString);
 }

 I can only get a few of links from that page.

 The url is from a Chinese site; however you can just skip those
 non-Enligsh
 contents and just see the html elements.

 2006/2/17, Guenter, Matthias [EMAIL PROTECTED]:
 
  Hi Elwin
  Can you provide samples of not working links and code? And put it into
  JIRA?
  Kind regards
  Matthias
 
 
 
  -Ursprüngliche Nachricht-
  Von: Elwin [mailto:[EMAIL PROTECTED]
  Gesendet: Fr 17.02.2006 08:51
  An: nutch-user@lucene.apache.org
  Betreff: extract links problem with parse-html plugin
 
  It seems that the parse-html plguin may not process many pages well,
  because
  I have found that the plugin can't extract all valid links in a page
 when
  I
  test it in my code.
  I guess that it may be caused by the style of a html page? When I view
  source of a html page I used to parse, I saw that some elements in the
  source are segmented by some unrequired spaces. However, the situation
 is
  quiet often to the pages of large portal sites or news sites.
 
 


 --
 《盖世豪侠》好评如潮，让无线收视居高不下，
 无线高兴之余，仍未重用。周星驰岂是池中物，
 喜剧天分既然崭露，当然不甘心受冷落，于是
 转投电影界，在大银幕上一展风采。无线既得
 千里马，又失千里马，当然后悔莫及。




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Question about fExtensionPoints in PluginRepository.java

2006-02-15 Thread Elwin

fExtensionPoints is a HashMap.
How about two plugins that extend the same Extension Point for the code 
fExtensionPoints.put(xpId, point)?

Re: Duplicate urls in urls file

2006-02-15 Thread Elwin

Did you achieve it by extending nutch with a plugin?
I think it's possible to achieve it in a URLFilter plugin to filter rss feed
links.


2006/2/16, Hasan Diwan [EMAIL PROTECTED]:

 Elwin:
 On 13/02/06, Elwin [EMAIL PROTECTED] wrote:
Do you use fixed set of rss feeds for crawl or discover rss feeds
  dynamically?

 Before I broke the script, it would take the URL, grab the feeds
 specified from the link tags, then parse them. I suspect this is
 similar to what the parse-rss plugin does, but I have not had the
 chance to look at it as yet.
 --
 Cheers,
 Hasan Diwan [EMAIL PROTECTED]




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: Duplicate urls in urls file

2006-02-13 Thread Elwin

Hi,  Hasan

   Do you use fixed set of rss feeds for crawl or discover rss feeds
dynamically?


2006/2/14, Hasan Diwan [EMAIL PROTECTED]:

 I've written a perl script to build up a urls file to crawl from RSS
 feeds. Will nutch handle duplicate URLs in the crawl file or would
 that logic need to be in my perl script?
 --
 Cheers,
 Hasan Diwan [EMAIL PROTECTED]




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Problem in debugging codes that using nutch api

2006-02-12 Thread Elwin

I have written some test codes using nutch api.
As nutch-default.xml and nutch-site.xml are included in nutch-0.7.jar, can I
debug my code with these files in a conf dir instead of binding in the jar
file?
Besides, how can I refer to other files like mime-types.xml in my code?
Where does NutchConf load them?

Why are other config files not included in nutch-0.7.jar

2006-02-12 Thread Elwin

other than nutch-default.xml and nutch-site.xml.

How to control contents to be indexed?

2006-02-10 Thread Elwin

In the process of crawling and indexing, some pages are just used as
temporary links  to the pages I want to index, so how can I control those
kinds of pages not being indexed? Or which part of nutch should I extend?

Re: How to control contents to be indexed?

2006-02-10 Thread Elwin

Thank you.
But what I want to crawl are just from the internent and certainly I can't
control them.


2006/2/10, Vanderdray, Jacob [EMAIL PROTECTED]:

If you control the temporary links pages, then just add a
 robots meta tag.  Take a look at
 http://www.robotstxt.org/wc/meta-user.html to see what your options are.

 Jake.

 -Original Message-
 From: Elwin [mailto:[EMAIL PROTECTED]
 Sent: Friday, February 10, 2006 4:38 AM
 To: nutch-user@lucene.apache.org
 Subject: How to control contents to be indexed?

 In the process of crawling and indexing, some pages are just used as
 temporary links  to the pages I want to index, so how can I control
 those
 kinds of pages not being indexed? Or which part of nutch should I
 extend?




--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Elwin

According to the code:
theOutlinks.add(new Outlink(r.getLink(), r
.getDescription()));
I can see that item description is also included.

However, when I tried with this feed:
http://kgrimm.bravejournal.com/feed.rss
I can only get the title and description for channel and failed to search
the words in item description.

From the above code, the item description is combined with outlink url, is
it used as contentTitle for that url? When the outlink is fetched and
parsed, I think new data about that url will be generated.


在06-2-11，Chris Mattmann [EMAIL PROTECTED] 写道：

 Hi,


the contentTitle will be a concatenation of the titles of the RSS
 Channels
  that we've parsed.
So the titles of the RSS Channels are what delivered for indexing,
 right?

 They're certainly part of it, but not the only part. The concatenation of
 the titles of the RSS Channels are what is delivered for the title
 portion
 of indexing.

If I want the indexer to include more information about a rss file
 (such
  as item descriptions), can I just concatenate them to the contentTitle?

 They're already there. There is a variable called index text: ultimately
 that variable includes the item descriptions, along with the channel
 descriptions. That, along with the title portion of indexing is the full
 set of textual data delivered by the parser for indexing. So, it already
 includes that information. Check out lines 137, and 161 in the parser to
 see
 what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

 contentTitle.toString(), outlinks, content.getMetadata());

 parseData.setConf(this.conf);

 return new ParseImpl(indexText.toString(), parseData);

 You can see that the return from the Parser, i.e., the ParseImpl, includes
 both the indexText, along with the parse data (that contains the title
 text).

 Now, if you wanted to add any other metadata gleaned from the RSS to the
 title text, or the content text, you can always modify the code to do that
 in your own environment. The RSS Parser plugin returns a full channel
 model
 and item model that can be extended and used for those purposes.

 Hope that helps!

 Cheers,
 Chris


 
 
  在06-2-6，Chris Mattmann [EMAIL PROTECTED] 写道：
 
  Hi there,
 
That should work: however, the biggest problem will be making sure
 that
  text/xml is actually the content type of the RSS that you are
 parsing,
  which you'll have little or no control over.
 
  Check out this previous post of mine on the list to get a better idea
 of
  what the real issue is:
 
  http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
  G'luck!
 
  Cheers,
  Chris
 
 
  __
  Chris A. Mattmann
  [EMAIL PROTECTED]
  Staff Member
  Modeling and Data Management Systems Section (387)
  Data Management Systems and Technologies Group
 
  _
  Jet Propulsion LaboratoryPasadena, CA
  Office: 171-266BMailstop:  171-246
  Phone:  818-354-8810
  ___
 
  Disclaimer:  The opinions presented within are my own and do not
 reflect
  those of either NASA, JPL, or the California Institute of Technology.
 
  -Original Message-
  From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
  Sent: Saturday, February 04, 2006 11:40 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Which version of rss does parse-rss plugin support?
 
  Hi Chris
 
 
  How do I change the plugin.xml? For example, if I want to crawl rss
  files
  end with xml, just add a new element?
 
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=rss/
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=xml/
 
  Am I right?
 
 
 
  在06-2-3，Chris Mattmann [EMAIL PROTECTED] 写道：
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
  there
  is
  an attribute called pathSuffix. Change that to handle whatever type
  of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
  you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that
  parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
  seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just to

How to deal with javascript urls?

Re: java.net.SocketTimeoutException: Read timed out

Re: java.net.SocketTimeoutException: Read timed out

Re: java.net.SocketTimeoutException: Read timed out

java.net.SocketTimeoutException: Read timed out

Inject url into a temp webdb

find duplicate urls in webdb

Re: About regex in the crawl-urlfilter.txt config file

Why Perl5 regular expressions?

Re: AW: extract links problem with parse-html plugin

Re: AW: extract links problem with parse-html plugin

Re: AW: extract links problem with parse-html plugin

Re: No Accents

Re: Content-based Crawl vs Link-based Crawl?

Content-based Crawl vs Link-based Crawl?

Re: extract links problem with parse-html plugin

Re: extract links problem with parse-html plugin

Question about fExtensionPoints in PluginRepository.java

Re: Duplicate urls in urls file

Re: Duplicate urls in urls file

Problem in debugging codes that using nutch api

Why are other config files not included in nutch-0.7.jar

How to control contents to be indexed?

Re: How to control contents to be indexed?

Re: Which version of rss does parse-rss plugin support?

25 matches

Site Navigation

Mail list logo

Footer information