refetch only

2006-02-05 Thread Raghavendra Prabhu
Hi

What does refetch only do?

Does it refetch pages which only have been already fetched (pages which have
been fetched in the webdb)

And skip other pages which have been reported as failed in the webdb

Rgds

Prabhu


fetchlist doubt

2006-02-05 Thread Raghavendra Prabhu
Hi

And a Fetch list entry has  a boolean called fetch (which decides it shud be
fetched)

Each page structure has a date.(with next fetch time)

is there one to one mapping between page and fetchlist entry

That is if u set the dont fetch field for a fetch list entry,which pages
will will it control

I think i am being ambiguous here but if anyone can help me .please do


Rgds
Prabhu


sockettimeout exception

2006-02-05 Thread Raghavendra Prabhu
Hi

I am running a crawl using protocol-httpclient

I get a
java.io.IOException: java.net.SocketTimeoutException: Read timed out

Can someone tell me the reason why i get the error

After that the crawl hangs and is simply in the same state

Rgds
Prabhu


Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf

Is the host in your web-browser available?
Does this host block your ip, since he understand nutch as a DOS attack?
Is you bandwidth limited?

Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:


Hi

I am running a crawl using protocol-httpclient

I get a
java.io.IOException: java.net.SocketTimeoutException: Read timed out

Can someone tell me the reason why i get the error

After that the crawl hangs and is simply in the same state

Rgds
Prabhu




Re: sockettimeout exception

2006-02-05 Thread Raghavendra Prabhu
Hi Stefan

My bandwidth is limited .

But i am able to crawl other links with the same host (so he is not denying
i guess)

Is it because of the protocol-httpclient(shud i use protocol-http)

Rgds
Prabhu


On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Is the host in your web-browser available?
 Does this host block your ip, since he understand nutch as a DOS attack?
 Is you bandwidth limited?

 Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:

  Hi
 
  I am running a crawl using protocol-httpclient
 
  I get a
  java.io.IOException: java.net.SocketTimeoutException: Read timed out
 
  Can someone tell me the reason why i get the error
 
  After that the crawl hangs and is simply in the same state
 
  Rgds
  Prabhu




Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf

I personal prefer protocol-http.

Am 05.02.2006 um 18:26 schrieb Raghavendra Prabhu:


Hi Stefan

My bandwidth is limited .

But i am able to crawl other links with the same host (so he is not  
denying

i guess)

Is it because of the protocol-httpclient(shud i use protocol-http)

Rgds
Prabhu


On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:


Is the host in your web-browser available?
Does this host block your ip, since he understand nutch as a DOS  
attack?

Is you bandwidth limited?

Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:


Hi

I am running a crawl using protocol-httpclient

I get a
java.io.IOException: java.net.SocketTimeoutException: Read timed out

Can someone tell me the reason why i get the error

After that the crawl hangs and is simply in the same state

Rgds
Prabhu







Re: sockettimeout exception

2006-02-05 Thread Raghavendra Prabhu
Hi Stefan

One more thing which i am seeing is some outlinks are not parsed properly.

I tried using both the html parser (neko and tagsoup)

I know that this may not be due to protocol-http but is  there a chance that
this may be also due to same reason ?

Thanks for the answer .

Rgds
Prabhu


On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:

 I personal prefer protocol-http.

 Am 05.02.2006 um 18:26 schrieb Raghavendra Prabhu:

  Hi Stefan
 
  My bandwidth is limited .
 
  But i am able to crawl other links with the same host (so he is not
  denying
  i guess)
 
  Is it because of the protocol-httpclient(shud i use protocol-http)
 
  Rgds
  Prabhu
 
 
  On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
 
  Is the host in your web-browser available?
  Does this host block your ip, since he understand nutch as a DOS
  attack?
  Is you bandwidth limited?
 
  Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:
 
  Hi
 
  I am running a crawl using protocol-httpclient
 
  I get a
  java.io.IOException: java.net.SocketTimeoutException: Read timed out
 
  Can someone tell me the reason why i get the error
 
  After that the crawl hangs and is simply in the same state
 
  Rgds
  Prabhu
 
 




Re: sockettimeout exception

2006-02-05 Thread Stefan Groschupf
may your page size is bigger than the setuped limit. See conf/nutch- 
*.xml


Am 05.02.2006 um 18:39 schrieb Raghavendra Prabhu:


Hi Stefan

One more thing which i am seeing is some outlinks are not parsed  
properly.


I tried using both the html parser (neko and tagsoup)

I know that this may not be due to protocol-http but is  there a  
chance that

this may be also due to same reason ?

Thanks for the answer .

Rgds
Prabhu


On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:


I personal prefer protocol-http.

Am 05.02.2006 um 18:26 schrieb Raghavendra Prabhu:


Hi Stefan

My bandwidth is limited .

But i am able to crawl other links with the same host (so he is not
denying
i guess)

Is it because of the protocol-httpclient(shud i use protocol-http)

Rgds
Prabhu


On 2/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:


Is the host in your web-browser available?
Does this host block your ip, since he understand nutch as a DOS
attack?
Is you bandwidth limited?

Am 05.02.2006 um 18:17 schrieb Raghavendra Prabhu:


Hi

I am running a crawl using protocol-httpclient

I get a
java.io.IOException: java.net.SocketTimeoutException: Read  
timed out


Can someone tell me the reason why i get the error

After that the crawl hangs and is simply in the same state

Rgds
Prabhu










How deep to go

2006-02-05 Thread Andy Morris
How deep should a good intranet crawl be...10-20?
I still can't get all of my site searchable..

Here is my situation...
I want to crawl just a local site for our intranet.   We have just
rolled out an asp only website from a pure html site.  I ran nutch on
the old site and got great results.  Since moving to this new site I am
have a devil of a time retrieving good information and missing a ton of
info all together.  I am not sure what settings I need to change to get
good results.  One setting that I have set does produce good results but
it seems to crawl other website and not just my domain.  The last line
of the crawl-urlfilter file I just replace the - with + so it does not
ignore other information. Our site is www.woodward.edu I was wondering
if someone on this list can crawl this site and only this domain and see
what they come up with.  Woodward.edu is the domain.  I am just stumped
as what to do next.  I am running a nightly build from January 26th
2006. 

My criteria for our local search is to be able to search PDF, images,
doc, and web content.  You can go here and see what the search page
pulls up http://search.woodward.edu .

Thanks for any help this list can provide.
Andy Morris 


How should I call to the class Injector from hadoop/trunk

2006-02-05 Thread Rafit Izhak_Ratzin

Hi,

I updated my environment to the newest subversion,
and after running my datanodes and namenode I would like to start fetching.

so my question is how should I call the the class 
org.apache.nutch.crawl.Injector?

if I am running under the path of .../hadoop/trunk ?

Thank you,
Rafit

_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




RE: Which version of rss does parse-rss plugin support?

2006-02-05 Thread Chris Mattmann
Hi there,

   That should work: however, the biggest problem will be making sure that
text/xml is actually the content type of the RSS that you are parsing,
which you'll have little or no control over. 

Check out this previous post of mine on the list to get a better idea of
what the real issue is:

http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there
  is
  an attribute called pathSuffix. Change that to handle whatever type of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
 you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
 seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just to let you know, there are some
 people
  in
  the group that are working on a solution to addressing this.
 
  Hope that helps.
 
  Cheers,
  Chris
 
 
 
  On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
   Hi *Chris,*
  
   The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
  it
   automatically as a rss file?
  
  
   在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
  
   Hi there,
  
   parse-rss is based on commons-feedparser
   (http://jakarta.apache.org/commons/sandbox/feedparser). From the
   feedparser
   website:
  
   ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92,
  1.0,
   and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
  extension
   and RSS 1.0 modules capability...
  
   Hope that helps.
  
   Thanks,
   Chris
  
  
   On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
  
   I see the test file is of version 0.91.
   Does the plugin support higher versions like 1.0 or 2.0?
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周
星驰岂是池中物,喜剧天
  分 既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既
得千里马,又失千里马,
  当 然
   后悔莫及。
  
  
  
  
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星
驰岂是池中物,喜剧天分既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然
   后悔莫及。
 
 
 
 
 
 --
 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂
是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一
 展风采。无线既得千里马,又失千里马,当然后悔莫及。



Problem indexing Files

2006-02-05 Thread Saravanaraj Duraisamy
Hi i am using nutch to index files in local FS and FTP.

my filter file is

-^(http|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
[EMAIL PROTECTED]
-.*(/.+?)/.*?\1/.*?\1/
+^file:/E:/Index Samples/
-^file:/E:/Index Samples/Index/

but nutch crawls the forbidden folders also. is there a web db kind of thing
for files also. is it possible to make nutch to index files based on the
last modified date.

can anybody suggest the datastructure for webdb (filedb??) for files. it
will be good to group files and create seperate segements for each group. so
if some files are changed, only those segments can be replaced.

Rgds,
D.Saravanaraj


Re: Which version of rss does parse-rss plugin support?

2006-02-05 Thread 盖世豪侠
Hi Chris,

 Thank you for your post and I've read it through.
  So, you mean I should also add these lines to the plugin.xml in most
cases:

   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
 ...
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=text/xml
   pathSuffix=xml/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=text/xml
   pathSuffix=rss/

在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi there,

   That should work: however, the biggest problem will be making sure that
 text/xml is actually the content type of the RSS that you are parsing,
 which you'll have little or no control over.

 Check out this previous post of mine on the list to get a better idea of
 what the real issue is:

 http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html

 G'luck!

 Cheers,
 Chris


 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group

 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 Phone:  818-354-8810
 ___

 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.

  -Original Message-
  From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
  Sent: Saturday, February 04, 2006 11:40 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Which version of rss does parse-rss plugin support?
 
  Hi Chris
 
 
  How do I change the plugin.xml? For example, if I want to crawl rss
 files
  end with xml, just add a new element?
 
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=rss/
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=xml/
 
  Am I right?
 
 
 
  在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
  
   Hi there,
   Sure it will, you just have to configure it to do that. Pop over to
   $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
 there
   is
   an attribute called pathSuffix. Change that to handle whatever type
 of
   rss
   file you want to crawl. That will work locally. For web-based crawls,
  you
   need to make sure that the content type being returned for your RSS
   content
   matches the content type specified in the plugin.xml file that
 parse-rss
   claims to support.
  
   Note that you might not have * a lot * of success with being able to
   control the content type for rss files returned by web servers. I've
  seen
   a
   LOT of inconsistency among the way that they're configured by the
   administrators, etc. However, just to let you know, there are some
  people
   in
   the group that are working on a solution to addressing this.
  
   Hope that helps.
  
   Cheers,
   Chris
  
  
  
   On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
  
Hi *Chris,*
   
The files of RSS 1.0 have a postfix of rdf. So willthe parser
  recognize
   it
automatically as a rss file?
   
   
在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
   
Hi there,
   
parse-rss is based on commons-feedparser
(http://jakarta.apache.org/commons/sandbox/feedparser). From the
feedparser
website:
   
...commons-feedparser supports all versions of RSS (0.9, 0.91,
 0.92,
   1.0,
and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
   extension
and RSS 1.0 modules capability...
   
Hope that helps.
   
Thanks,
Chris
   
   
On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
   
I see the test file is of version 0.91.
Does the plugin support higher versions like 1.0 or 2.0?
   
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周
 星驰岂是池中物,喜剧天
   分 既
然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既
 得千里马,又失千里马,
   当 然
后悔莫及。
   
   
   
   
   
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星
 驰岂是池中物,喜剧天分既
然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得
 千里马,又失千里马,当然
后悔莫及。
  
  
  
 
 
  --
  《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂
 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一
  展风采。无线既得千里马,又失千里马,当然后悔莫及。




--