is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi,

I am crawling some sites using nutch. My requirement is, when i run a nutch
crawl, then somehow it should be able to reuse the data in webdb populated
in previous crawl.

In other words my question is suppose my crawl is running and i cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way then please
throw some light on this.

TIA

Regards,
Pushpesh


Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
It is difficult to answer your question since the used vocabulary is  
may wrong.
You can refetch pages, no problem. But you can not continue a crashed  
fetch process.
Nutch provides a tool that runs a set of steps like, segment  
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the  
crawl command.
Than you may will already get an idea where you can jump in to grep  
your needed data.


Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:


Hi,

I am crawling some sites using nutch. My requirement is, when i run  
a nutch
crawl, then somehow it should be able to reuse the data in webdb  
populated

in previous crawl.

In other words my question is suppose my crawl is running and i  
cancel it

somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way  
then please

throw some light on this.

TIA

Regards,
Pushpesh




Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run a
crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down crawl
wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step commands
needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option to
alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned and
get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 It is difficult to answer your question since the used vocabulary is
 may wrong.
 You can refetch pages, no problem. But you can not continue a crashed
 fetch process.
 Nutch provides a tool that runs a set of steps like, segment
 generation, fetching, db updateting etc.
 So may first try to run these steps manually instead of using the
 crawl command.
 Than you may will already get an idea where you can jump in to grep
 your needed data.

 Stefan

 Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:

  Hi,
 
  I am crawling some sites using nutch. My requirement is, when i run
  a nutch
  crawl, then somehow it should be able to reuse the data in webdb
  populated
  in previous crawl.
 
  In other words my question is suppose my crawl is running and i
  cancel it
  somewhere in middle, then is there someway i can resume the crawl ?
 
 
  I dont know even if i can do this at all or if there is some way
  then please
  throw some light on this.
 
  TIA
 
  Regards,
  Pushpesh




Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
Still do not clearly understand you plans, sorry. However pages from  
the webdb are recrawled every 30 days (but configurable in the nutch- 
default.xml).
The new folder are so called segments and you can put it to the trash  
after 30 days.
So what you can do is first never updated your webdb with the fetched  
segment, that will not add new urls, or alternative use a url filter.

You will find a lot of posts in the mail archive regarding this issues.
Stefan
Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:


Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick  
response

really appreciate it.

Actually what i am really looking is, suppose i run a crawl for  
sometime
sites say 5 and for some depth say 2. Then what i want is next time  
i run a
crawl it should re use the webdb contents which it populated first  
time.
(Assuming a successful crawl. Yea you are right a suddenly broken  
down crawl

wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step  
commands
needed to crawl, but isnt there some way i can reuse the existing  
crawl
data? May be it involves changing code but thats ok. Just one more  
quick
question, why every crawl needs a new directory and there isnt an  
option to

alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u  
mentioned and

get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:


It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:


Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh







Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Actually i wanted to reuse the processings i do in a particular crawl for
future crawls so as to avoid downloading pages which are not of my interest.


Here is an example:

1. Suppose i am crawling http://www.abc.com website.
2. Then this gets injected in webdb and Fetchlist tool populates fetchlist
in segment dir from webdb.
3. Then Fetcher creates FetcherThreads which download the content of this
page.
4. Now once i download any page then i analyse the page and may be want to
mark this page as blocked (as i find it useless for me) and store this
information persistently so when i do a crawl next time for same site it
remembers that i blocked it and hence it skips downloading this url.

So basically its like this.

I run a crawl and suppose out of total 100 pages i mark 60 pages as blocked
then after this crawl finishes, i run same crawl again but this time i want
those 60 urls not to be downloaded since i marked them as blocked. Actually
my question can i do this somewhere in nutch? May be i assign very low or
zero score to these urls and make my cutt off score above then this. But
problem with crawl is everytime i do it, it requires the directory should
not be already created and hence my prev data cant be used.

But i think as you suggested me those steps i think those seem valuable and
may be i will have to write my own CrawlTool to make it behave as i really
need it so i think i got the clue and just need to work it out.

Thanks for the valuable info and your precious time. Hope i am clearer this
time :-)

Regards
Pushpesh




For example suppose i crawl a website www.abc.com then i find some links in
it and then suppose i assign my own score (i've done code changes for this
already) to the urls found in www.abc.com site before fetching the contents


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Still do not clearly understand you plans, sorry. However pages from
 the webdb are recrawled every 30 days (but configurable in the nutch-
 default.xml).
 The new folder are so called segments and you can put it to the trash
 after 30 days.
 So what you can do is first never updated your webdb with the fetched
 segment, that will not add new urls, or alternative use a url filter.
 You will find a lot of posts in the mail archive regarding this issues.
 Stefan
 Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:

  Hi Stefan,
 
  Thanks for lightening fast reply. I was amazed to see such quick
  response
  really appreciate it.
 
  Actually what i am really looking is, suppose i run a crawl for
  sometime
  sites say 5 and for some depth say 2. Then what i want is next time
  i run a
  crawl it should re use the webdb contents which it populated first
  time.
  (Assuming a successful crawl. Yea you are right a suddenly broken
  down crawl
  wont work as it has lost its integrity of data)
 
  As you said we can run tools provided by nutch to do step by step
  commands
  needed to crawl, but isnt there some way i can reuse the existing
  crawl
  data? May be it involves changing code but thats ok. Just one more
  quick
  question, why every crawl needs a new directory and there isnt an
  option to
  alteast reuse the webdb? May be i am asking something silly but i am
  clueless :-(
 
  Or as you said may be what i can do is to explore the steps u
  mentioned and
  get what i need.
 
  Thanks again,
  Pushpesh
 
 
  On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
 
  It is difficult to answer your question since the used vocabulary is
  may wrong.
  You can refetch pages, no problem. But you can not continue a crashed
  fetch process.
  Nutch provides a tool that runs a set of steps like, segment
  generation, fetching, db updateting etc.
  So may first try to run these steps manually instead of using the
  crawl command.
  Than you may will already get an idea where you can jump in to grep
  your needed data.
 
  Stefan
 
  Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
 
  Hi,
 
  I am crawling some sites using nutch. My requirement is, when i run
  a nutch
  crawl, then somehow it should be able to reuse the data in webdb
  populated
  in previous crawl.
 
  In other words my question is suppose my crawl is running and i
  cancel it
  somewhere in middle, then is there someway i can resume the crawl ?
 
 
  I dont know even if i can do this at all or if there is some way
  then please
  throw some light on this.
 
  TIA
 
  Regards,
  Pushpesh
 
 




Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
About this blocking you can try to use the urlfilters, change the 
filter between each  fetch/generate


+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:


Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 


Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:

   


Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run
 


a
   


crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down
 


crawl
   


wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step
 


commands
   


needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option
 


to
   


alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned
 


and
   


get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:


 


It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:



   


Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh


 

   





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 


16.12.2005
   

 

   



 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
 





Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.

Thanks anyway
Pushpesh


On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:

 About this blocking you can try to use the urlfilters, change the
 filter between each  fetch/generate

 +^http://www.abc.com

 -^http://www.bbc.co.uk


 Pushpesh Kr. Rajwanshi wrote:

 Oh this is pretty good and quite helpful material i wanted. Thanks Havard
 for this. Seems like this will help me writing code for stuff i need :-)
 
 Thanks and Regards,
 Pushpesh
 
 
 
 On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 
 
 Try using the whole-web fetching method instead of the crawl method.
 
 http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
 
 http://wiki.media-style.com/display/nutchDocu/quick+tutorial
 
 
 Pushpesh Kr. Rajwanshi wrote:
 
 
 
 Hi Stefan,
 
 Thanks for lightening fast reply. I was amazed to see such quick
 response
 really appreciate it.
 
 Actually what i am really looking is, suppose i run a crawl for
 sometime
 sites say 5 and for some depth say 2. Then what i want is next time i
 run
 
 
 a
 
 
 crawl it should re use the webdb contents which it populated first
 time.
 (Assuming a successful crawl. Yea you are right a suddenly broken down
 
 
 crawl
 
 
 wont work as it has lost its integrity of data)
 
 As you said we can run tools provided by nutch to do step by step
 
 
 commands
 
 
 needed to crawl, but isnt there some way i can reuse the existing crawl
 data? May be it involves changing code but thats ok. Just one more
 quick
 question, why every crawl needs a new directory and there isnt an
 option
 
 
 to
 
 
 alteast reuse the webdb? May be i am asking something silly but i am
 clueless :-(
 
 Or as you said may be what i can do is to explore the steps u mentioned
 
 
 and
 
 
 get what i need.
 
 Thanks again,
 Pushpesh
 
 
 On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
 
 
 
 
 It is difficult to answer your question since the used vocabulary is
 may wrong.
 You can refetch pages, no problem. But you can not continue a crashed
 fetch process.
 Nutch provides a tool that runs a set of steps like, segment
 generation, fetching, db updateting etc.
 So may first try to run these steps manually instead of using the
 crawl command.
 Than you may will already get an idea where you can jump in to grep
 your needed data.
 
 Stefan
 
 Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
 
 
 
 
 
 Hi,
 
 I am crawling some sites using nutch. My requirement is, when i run
 a nutch
 crawl, then somehow it should be able to reuse the data in webdb
 populated
 in previous crawl.
 
 In other words my question is suppose my crawl is running and i
 cancel it
 somewhere in middle, then is there someway i can resume the crawl ?
 
 
 I dont know even if i can do this at all or if there is some way
 then please
 throw some light on this.
 
 TIA
 
 Regards,
 Pushpesh
 
 
 
 
 
 
 

 
 
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 
 
 16.12.2005
 
 
 
 
 
 
 
 
 
 
 
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 16.12.2005
 
 




Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh,

We extended nutch with a whitelist filter and you might find it useful. 
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all

--Flo

Pushpesh Kr. Rajwanshi wrote:

hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.

Thanks anyway
Pushpesh


On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
  

About this blocking you can try to use the urlfilters, change the
filter between each  fetch/generate

+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:



Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:


  

Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:





Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick
  

response


really appreciate it.

Actually what i am really looking is, suppose i run a crawl for
  

sometime


sites say 5 and for some depth say 2. Then what i want is next time i
  

run


  

a




crawl it should re use the webdb contents which it populated first
  

time.


(Assuming a successful crawl. Yea you are right a suddenly broken down


  

crawl




wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step


  

commands




needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more
  

quick


question, why every crawl needs a new directory and there isnt an
  

option


  

to




alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned


  

and




get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:




  

It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:







Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh




  





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:


  

16.12.2005




  






No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
  

16.12.2005