injecting URLs with '?'

2005-12-19 Thread Miguel A Paraz
Hi,
I'm indexing blog permalinks taken from a Roller Weblogger aggregator
- like how Technorati does it. I noticed that 'inject' omits URLs with
'?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner).

 How can I include these?


Re: injecting URLs with '?'

2005-12-19 Thread Stefan Groschupf

change:
NUTCH/conf/regex-urlfilter.txt
from:
[EMAIL PROTECTED]
to:
[EMAIL PROTECTED]
That's it.

Stefan


Am 19.12.2005 um 11:56 schrieb Miguel A Paraz:


Hi,
I'm indexing blog permalinks taken from a Roller Weblogger aggregator
- like how Technorati does it. I noticed that 'inject' omits URLs with
'?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner).

 How can I include these?





Re: nutch crawl fails with: org.apache.nutch.indexer.IndexingFilter does not exist.

2005-12-19 Thread Stephen Fitch
Hi Jérôme,

Many thanks for this email. I had found I needed 'nutch-extensionpoints',
but
with your explaination below I have a better understanding of the reason it
is needed.

Thanks once again.

Stephen

On 12/19/05, Jérôme Charron [EMAIL PROTECTED] wrote:

 nutch-extensionpoints is the plugin that defines all the nutch standard
 extension points: ie all the other plugins have a dependency on it.
 So, it is mandatory to include it in the list of activated plugin, or you
 must turn to true the plugin.auto-activation property, so that when a
 plugin
 is activated, all its dependencies will be automatically loaded.

 Jérôme





is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi,

I am crawling some sites using nutch. My requirement is, when i run a nutch
crawl, then somehow it should be able to reuse the data in webdb populated
in previous crawl.

In other words my question is suppose my crawl is running and i cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way then please
throw some light on this.

TIA

Regards,
Pushpesh


Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
It is difficult to answer your question since the used vocabulary is  
may wrong.
You can refetch pages, no problem. But you can not continue a crashed  
fetch process.
Nutch provides a tool that runs a set of steps like, segment  
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the  
crawl command.
Than you may will already get an idea where you can jump in to grep  
your needed data.


Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:


Hi,

I am crawling some sites using nutch. My requirement is, when i run  
a nutch
crawl, then somehow it should be able to reuse the data in webdb  
populated

in previous crawl.

In other words my question is suppose my crawl is running and i  
cancel it

somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way  
then please

throw some light on this.

TIA

Regards,
Pushpesh




Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run a
crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down crawl
wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step commands
needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option to
alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned and
get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 It is difficult to answer your question since the used vocabulary is
 may wrong.
 You can refetch pages, no problem. But you can not continue a crashed
 fetch process.
 Nutch provides a tool that runs a set of steps like, segment
 generation, fetching, db updateting etc.
 So may first try to run these steps manually instead of using the
 crawl command.
 Than you may will already get an idea where you can jump in to grep
 your needed data.

 Stefan

 Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:

  Hi,
 
  I am crawling some sites using nutch. My requirement is, when i run
  a nutch
  crawl, then somehow it should be able to reuse the data in webdb
  populated
  in previous crawl.
 
  In other words my question is suppose my crawl is running and i
  cancel it
  somewhere in middle, then is there someway i can resume the crawl ?
 
 
  I dont know even if i can do this at all or if there is some way
  then please
  throw some light on this.
 
  TIA
 
  Regards,
  Pushpesh




Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
Still do not clearly understand you plans, sorry. However pages from  
the webdb are recrawled every 30 days (but configurable in the nutch- 
default.xml).
The new folder are so called segments and you can put it to the trash  
after 30 days.
So what you can do is first never updated your webdb with the fetched  
segment, that will not add new urls, or alternative use a url filter.

You will find a lot of posts in the mail archive regarding this issues.
Stefan
Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:


Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick  
response

really appreciate it.

Actually what i am really looking is, suppose i run a crawl for  
sometime
sites say 5 and for some depth say 2. Then what i want is next time  
i run a
crawl it should re use the webdb contents which it populated first  
time.
(Assuming a successful crawl. Yea you are right a suddenly broken  
down crawl

wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step  
commands
needed to crawl, but isnt there some way i can reuse the existing  
crawl
data? May be it involves changing code but thats ok. Just one more  
quick
question, why every crawl needs a new directory and there isnt an  
option to

alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u  
mentioned and

get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:


It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:


Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh







Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Actually i wanted to reuse the processings i do in a particular crawl for
future crawls so as to avoid downloading pages which are not of my interest.


Here is an example:

1. Suppose i am crawling http://www.abc.com website.
2. Then this gets injected in webdb and Fetchlist tool populates fetchlist
in segment dir from webdb.
3. Then Fetcher creates FetcherThreads which download the content of this
page.
4. Now once i download any page then i analyse the page and may be want to
mark this page as blocked (as i find it useless for me) and store this
information persistently so when i do a crawl next time for same site it
remembers that i blocked it and hence it skips downloading this url.

So basically its like this.

I run a crawl and suppose out of total 100 pages i mark 60 pages as blocked
then after this crawl finishes, i run same crawl again but this time i want
those 60 urls not to be downloaded since i marked them as blocked. Actually
my question can i do this somewhere in nutch? May be i assign very low or
zero score to these urls and make my cutt off score above then this. But
problem with crawl is everytime i do it, it requires the directory should
not be already created and hence my prev data cant be used.

But i think as you suggested me those steps i think those seem valuable and
may be i will have to write my own CrawlTool to make it behave as i really
need it so i think i got the clue and just need to work it out.

Thanks for the valuable info and your precious time. Hope i am clearer this
time :-)

Regards
Pushpesh




For example suppose i crawl a website www.abc.com then i find some links in
it and then suppose i assign my own score (i've done code changes for this
already) to the urls found in www.abc.com site before fetching the contents


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Still do not clearly understand you plans, sorry. However pages from
 the webdb are recrawled every 30 days (but configurable in the nutch-
 default.xml).
 The new folder are so called segments and you can put it to the trash
 after 30 days.
 So what you can do is first never updated your webdb with the fetched
 segment, that will not add new urls, or alternative use a url filter.
 You will find a lot of posts in the mail archive regarding this issues.
 Stefan
 Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi:

  Hi Stefan,
 
  Thanks for lightening fast reply. I was amazed to see such quick
  response
  really appreciate it.
 
  Actually what i am really looking is, suppose i run a crawl for
  sometime
  sites say 5 and for some depth say 2. Then what i want is next time
  i run a
  crawl it should re use the webdb contents which it populated first
  time.
  (Assuming a successful crawl. Yea you are right a suddenly broken
  down crawl
  wont work as it has lost its integrity of data)
 
  As you said we can run tools provided by nutch to do step by step
  commands
  needed to crawl, but isnt there some way i can reuse the existing
  crawl
  data? May be it involves changing code but thats ok. Just one more
  quick
  question, why every crawl needs a new directory and there isnt an
  option to
  alteast reuse the webdb? May be i am asking something silly but i am
  clueless :-(
 
  Or as you said may be what i can do is to explore the steps u
  mentioned and
  get what i need.
 
  Thanks again,
  Pushpesh
 
 
  On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
 
  It is difficult to answer your question since the used vocabulary is
  may wrong.
  You can refetch pages, no problem. But you can not continue a crashed
  fetch process.
  Nutch provides a tool that runs a set of steps like, segment
  generation, fetching, db updateting etc.
  So may first try to run these steps manually instead of using the
  crawl command.
  Than you may will already get an idea where you can jump in to grep
  your needed data.
 
  Stefan
 
  Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
 
  Hi,
 
  I am crawling some sites using nutch. My requirement is, when i run
  a nutch
  crawl, then somehow it should be able to reuse the data in webdb
  populated
  in previous crawl.
 
  In other words my question is suppose my crawl is running and i
  cancel it
  somewhere in middle, then is there someway i can resume the crawl ?
 
 
  I dont know even if i can do this at all or if there is some way
  then please
  throw some light on this.
 
  TIA
 
  Regards,
  Pushpesh
 
 




Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
About this blocking you can try to use the urlfilters, change the 
filter between each  fetch/generate


+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:


Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 


Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:

   


Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick response
really appreciate it.

Actually what i am really looking is, suppose i run a crawl for sometime
sites say 5 and for some depth say 2. Then what i want is next time i run
 


a
   


crawl it should re use the webdb contents which it populated first time.
(Assuming a successful crawl. Yea you are right a suddenly broken down
 


crawl
   


wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step
 


commands
   


needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more quick
question, why every crawl needs a new directory and there isnt an option
 


to
   


alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned
 


and
   


get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:


 


It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:



   


Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh


 

   





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 


16.12.2005
   

 

   



 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
 





Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.

Thanks anyway
Pushpesh


On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:

 About this blocking you can try to use the urlfilters, change the
 filter between each  fetch/generate

 +^http://www.abc.com

 -^http://www.bbc.co.uk


 Pushpesh Kr. Rajwanshi wrote:

 Oh this is pretty good and quite helpful material i wanted. Thanks Havard
 for this. Seems like this will help me writing code for stuff i need :-)
 
 Thanks and Regards,
 Pushpesh
 
 
 
 On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 
 
 Try using the whole-web fetching method instead of the crawl method.
 
 http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
 
 http://wiki.media-style.com/display/nutchDocu/quick+tutorial
 
 
 Pushpesh Kr. Rajwanshi wrote:
 
 
 
 Hi Stefan,
 
 Thanks for lightening fast reply. I was amazed to see such quick
 response
 really appreciate it.
 
 Actually what i am really looking is, suppose i run a crawl for
 sometime
 sites say 5 and for some depth say 2. Then what i want is next time i
 run
 
 
 a
 
 
 crawl it should re use the webdb contents which it populated first
 time.
 (Assuming a successful crawl. Yea you are right a suddenly broken down
 
 
 crawl
 
 
 wont work as it has lost its integrity of data)
 
 As you said we can run tools provided by nutch to do step by step
 
 
 commands
 
 
 needed to crawl, but isnt there some way i can reuse the existing crawl
 data? May be it involves changing code but thats ok. Just one more
 quick
 question, why every crawl needs a new directory and there isnt an
 option
 
 
 to
 
 
 alteast reuse the webdb? May be i am asking something silly but i am
 clueless :-(
 
 Or as you said may be what i can do is to explore the steps u mentioned
 
 
 and
 
 
 get what i need.
 
 Thanks again,
 Pushpesh
 
 
 On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
 
 
 
 
 It is difficult to answer your question since the used vocabulary is
 may wrong.
 You can refetch pages, no problem. But you can not continue a crashed
 fetch process.
 Nutch provides a tool that runs a set of steps like, segment
 generation, fetching, db updateting etc.
 So may first try to run these steps manually instead of using the
 crawl command.
 Than you may will already get an idea where you can jump in to grep
 your needed data.
 
 Stefan
 
 Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:
 
 
 
 
 
 Hi,
 
 I am crawling some sites using nutch. My requirement is, when i run
 a nutch
 crawl, then somehow it should be able to reuse the data in webdb
 populated
 in previous crawl.
 
 In other words my question is suppose my crawl is running and i
 cancel it
 somewhere in middle, then is there someway i can resume the crawl ?
 
 
 I dont know even if i can do this at all or if there is some way
 then please
 throw some light on this.
 
 TIA
 
 Regards,
 Pushpesh
 
 
 
 
 
 
 

 
 
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 
 
 16.12.2005
 
 
 
 
 
 
 
 
 
 
 
 No virus found in this incoming message.
 Checked by AVG Free Edition.
 Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
 16.12.2005
 
 




build instructions?

2005-12-19 Thread Teruhiko Kurosaka
Where can I find the build instructions for Nutch?

Just typing ant ended with an error complaining that
there is no such directory as 
...\src\plugin\nutch-extensionpoints\src\java

This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried JDK 1.4.1 but I saw the same failure.)

-Kuro


RE: build instructions?

2005-12-19 Thread Goldschmidt, Dave
Hello, I ran into the same problem (which I think is fixed in future
releases).  For Nutch 0.7.1, just create the missing directories and run
the ant script again.

HTH,
DaveG


-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 19, 2005 2:38 PM
To: nutch-user@lucene.apache.org
Subject: build instructions?

Where can I find the build instructions for Nutch?

Just typing ant ended with an error complaining that
there is no such directory as 
...\src\plugin\nutch-extensionpoints\src\java

This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried JDK 1.4.1 but I saw the same failure.)

-Kuro


Re: build instructions?

2005-12-19 Thread Stefan Groschupf

This is a known bug. Just create a empty folder
...\src\plugin\nutch-extensionpoints\src\java
and it will work. This is fixed in latest trunk that you can checkout  
form apache's subversion server.


Stefan

Am 19.12.2005 um 20:38 schrieb Teruhiko Kurosaka:


Where can I find the build instructions for Nutch?

Just typing ant ended with an error complaining that
there is no such directory as
...\src\plugin\nutch-extensionpoints\src\java

This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried JDK 1.4.1 but I saw the same failure.)

-Kuro



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: build instructions?

2005-12-19 Thread Jed Reynolds
Teruhiko Kurosaka wrote:
 Where can I find the build instructions for Nutch?
 
 Just typing ant ended with an error complaining that
 there is no such directory as 
 ...\src\plugin\nutch-extensionpoints\src\java

mkdir -p  that directory and try again.
If you're tracking your build in a local CVS, it's handy to
add those dirs to your local CVS.

 
 This is Nutch 0.7.1 download and I'm trying to build
 on Windows XP Professional with Cygwin and JDK 1.5.
 (I tried JDK 1.4.1 but I saw the same failure.)
 
 -Kuro
 


-- 
Jed Reynolds
System Administrator, PRWeb International, Inc. 360-312-0892


Re: build instructions?

2005-12-19 Thread Piotr Kosiorowski
It is a known bug in 0.7.1 distribution. You can get the sources 
directly from svn and it build fine. It is also fixed in preparation for 
0.7.2 release and in trunk. Or you can fix it locally by creating empty 
src/java folder I am not sure if it is the only one empty folder missing 
in nutch-extensionpoints folder but there should be not so many of them.

Regards
Piotr
Teruhiko Kurosaka wrote:

Where can I find the build instructions for Nutch?

Just typing ant ended with an error complaining that
there is no such directory as 
...\src\plugin\nutch-extensionpoints\src\java


This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried JDK 1.4.1 but I saw the same failure.)

-Kuro





RE: build instructions?

2005-12-19 Thread Teruhiko Kurosaka
Thank you, everybody.  I can build now! 

 -Original Message-
 From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] 
 Sent: 2005年12月19日 11:42
 To: nutch-user@lucene.apache.org
 Subject: RE: build instructions?
 
 Hello, I ran into the same problem (which I think is fixed in future
 releases).  For Nutch 0.7.1, just create the missing 
 directories and run
 the ant script again.
 
 HTH,
 DaveG
 
 
 -Original Message-
 From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
 Sent: Monday, December 19, 2005 2:38 PM
 To: nutch-user@lucene.apache.org
 Subject: build instructions?
 
 Where can I find the build instructions for Nutch?
 
 Just typing ant ended with an error complaining that
 there is no such directory as 
 ...\src\plugin\nutch-extensionpoints\src\java
 
 This is Nutch 0.7.1 download and I'm trying to build
 on Windows XP Professional with Cygwin and JDK 1.5.
 (I tried JDK 1.4.1 but I saw the same failure.)
 
 -Kuro
 


Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh,

We extended nutch with a whitelist filter and you might find it useful. 
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all

--Flo

Pushpesh Kr. Rajwanshi wrote:

hmmm... actually my requirement is a bit more complex than it seems so url
filters alone probably would do. Because i am not filtering urls based only
on some domain name but within domain i want to discard some urls, and since
they actually dont follow a pattern hence i cant use url filters otherwise
url filters would have done great job.

Thanks anyway
Pushpesh


On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
  

About this blocking you can try to use the urlfilters, change the
filter between each  fetch/generate

+^http://www.abc.com

-^http://www.bbc.co.uk


Pushpesh Kr. Rajwanshi wrote:



Oh this is pretty good and quite helpful material i wanted. Thanks Havard
for this. Seems like this will help me writing code for stuff i need :-)

Thanks and Regards,
Pushpesh



On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:


  

Try using the whole-web fetching method instead of the crawl method.

http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling

http://wiki.media-style.com/display/nutchDocu/quick+tutorial


Pushpesh Kr. Rajwanshi wrote:





Hi Stefan,

Thanks for lightening fast reply. I was amazed to see such quick
  

response


really appreciate it.

Actually what i am really looking is, suppose i run a crawl for
  

sometime


sites say 5 and for some depth say 2. Then what i want is next time i
  

run


  

a




crawl it should re use the webdb contents which it populated first
  

time.


(Assuming a successful crawl. Yea you are right a suddenly broken down


  

crawl




wont work as it has lost its integrity of data)

As you said we can run tools provided by nutch to do step by step


  

commands




needed to crawl, but isnt there some way i can reuse the existing crawl
data? May be it involves changing code but thats ok. Just one more
  

quick


question, why every crawl needs a new directory and there isnt an
  

option


  

to




alteast reuse the webdb? May be i am asking something silly but i am
clueless :-(

Or as you said may be what i can do is to explore the steps u mentioned


  

and




get what i need.

Thanks again,
Pushpesh


On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote:




  

It is difficult to answer your question since the used vocabulary is
may wrong.
You can refetch pages, no problem. But you can not continue a crashed
fetch process.
Nutch provides a tool that runs a set of steps like, segment
generation, fetching, db updateting etc.
So may first try to run these steps manually instead of using the
crawl command.
Than you may will already get an idea where you can jump in to grep
your needed data.

Stefan

Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi:







Hi,

I am crawling some sites using nutch. My requirement is, when i run
a nutch
crawl, then somehow it should be able to reuse the data in webdb
populated
in previous crawl.

In other words my question is suppose my crawl is running and i
cancel it
somewhere in middle, then is there someway i can resume the crawl ?


I dont know even if i can do this at all or if there is some way
then please
throw some light on this.

TIA

Regards,
Pushpesh




  





No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:


  

16.12.2005




  






No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date:
  

16.12.2005


  




  




Appropriate steps for mapred

2005-12-19 Thread Michael Taggart
I have followed the tutorial at media-style.com and actually have a
mapred installation of nutch working. Thanks Stefan :)
My question now is the correct steps to continuously fetch and index. I
have read some people talking about mergesegs and updatedb however
Stefan's tutorial doesn't list these as steps. If you want to
continually fetch more and more levels from your crawldb and
appropriately update your index what is the correct method for doing so?
Currently I am doing this:
generate
fetch
invertlinks
index

Only problem I am having is that I seem to not be able to get any pages
past the index pages on the root domains I injected. I feel like I am
missing some important steps. Any input is appreciated.
Mike


Re: Appropriate steps for mapred

2005-12-19 Thread Stefan Groschupf

Stefan's tutorial doesn't list these as steps.

I will add these steps hopefully until this year.

If you want to
continually fetch more and more levels from your crawldb and
appropriately update your index what is the correct method for  
doing so?

Currently I am doing this:
generate
fetch
invertlinks
index


Looks like you missed to update the crawldb after fetching, but in  
general that is the way to go.
You can run this cycle 10 times or more :). I suggest have big  
enough segments size and later merging some indexes together.

Just play around and try it out.

The size of segments and how many segment indexes you should merge  
very much depends on your hardware.
Also note that searching from a index stored on ndfs is slow, but  
there will be a solution for that until next weeks or so.


HTH
Stefan




Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code.  I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.  
 
I'm using Nutch for an intranet style search engine, on a single
site, so I don't really care about the uniqueness by domain.  However, I
can't help thinking that the uniqueness by anchor text probably isn't
what I want.
 
Suppose my site has 3 pages with links to page X, and the same anchor
text.  I'd kind of like to score page X higher than a page where there's
only one incoming link with that anchor text.  But I don't want to have
this effect swamping the other calculations of page score.  In other
words, if my site has 1000 pages with links to page X, this page should
score a wee bit higher than a similar page with just one incoming link,
but not 1000 times higher.
 
I'm thinking of doing some maths with the number of repetitions of an
anchor, then including the result in the page score.  Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text.  Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming link,
which seems about right to me.
 
It looks to me like I'm going to have to make changes deep within the
Lucene page scoring stuff to do this, which I'm not really looking
forward to.  I'd really welcome hearing if anybody has a better solution
to this general problem.  The exact maths isn't too critical.  What's
important is that for small values of n, the page score must increase as
n increases, but the overall effect must diminish as n gets really
large.
 
Thanks in advance,
David.


This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.




Re: Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread Stefan Groschupf

Hi,
did you tried...
property
  namedb.ignore.internal.links/name
  valuetrue/value
  descriptionIf true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping the only the highest quality
  links.
  /description
/property

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:


Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code.  I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.

I'm using Nutch for an intranet style search engine, on a single
site, so I don't really care about the uniqueness by domain.   
However, I

can't help thinking that the uniqueness by anchor text probably isn't
what I want.

Suppose my site has 3 pages with links to page X, and the same anchor
text.  I'd kind of like to score page X higher than a page where  
there's
only one incoming link with that anchor text.  But I don't want to  
have

this effect swamping the other calculations of page score.  In other
words, if my site has 1000 pages with links to page X, this page  
should
score a wee bit higher than a similar page with just one incoming  
link,

but not 1000 times higher.

I'm thinking of doing some maths with the number of repetitions of an
anchor, then including the result in the page score.  Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text.  Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming  
link,

which seems about right to me.

It looks to me like I'm going to have to make changes deep within the
Lucene page scoring stuff to do this, which I'm not really looking
forward to.  I'd really welcome hearing if anybody has a better  
solution

to this general problem.  The exact maths isn't too critical.  What's
important is that for small values of n, the page score must  
increase as

n increases, but the overall effect must diminish as n gets really
large.

Thanks in advance,
David.

** 
**
This email may contain legally privileged information and is  
intended only for the addressee. It is not necessarily the official  
view or
communication of the New Zealand Qualifications Authority. If you  
are not the intended recipient you must not use, disclose, copy or  
distribute this email or
information in it. If you have received this email in error, please  
contact the sender immediately. NZQA does not accept any liability  
for changes made to this email or attachments after sending by NZQA.


All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through  
its network.


** 
**


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
Thank you Stefan, for your speedy response.
 
I have indeed changed that setting to false.  However, that doesn't
deal with my problem.  The offending method is getAnchors in
org.apache.nutch.db.WebDBAnchors, which is called from
org.apache.nutch.tools.FetchListTool.  This method makes the array of
anchors unique, for the FetchListEntry (unless of course, the incoming
links are from different domains); and does so regardless of any
NutchConf setting.
 
If I changed the WebDBAnchors class, in order to disable this
uniqueness; I'd then need to incorporate some kind of numerical fudging
into the scoring.  This is to prevent the scores being badly skewed in
the cases where I have a page with a large number of incoming links, all
with the same anchor text.  This is likely to occur for pages that have
links in my site's navigation chrome, for example.
 
I suspect I shall have to bite the bullet, and start studying Lucene's
internal mathematics.
 
Regards,
David.
 
Stefan Groschupf wrote:

Hi,
did you tried...
property
   namedb.ignore.internal.links/name
   valuetrue/value
   descriptionIf true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping the only the highest quality
   links.
   /description
/property

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:

 Hi all,
 I've been grubbing around with Nutch for a while now, although I'm
 still working with 0.7 code.  I notice that when anchors are
collected
 for a document, they're made unique by domain and by anchor text.

[ some snipped ]



This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.




Re: How to recrawl urls

2005-12-19 Thread Kumar Limbu
Hi Nguyen,

Thank you for you information, but I would like to confirm that. I do see a
variable that define the next fetch interval but I am not sure of it. If
anyone has more information on this regard please let me know.

Thank you in advance,




On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:

 As I understand, by default, all links in Nutch are recrawled after 30
 days, as long as your Nutch process is still running. FetchListTool takes
 care of this setting. So maybe you can write a script (and put it in
 cron?)
 to reactivate the crawler.

 Regards,
   Giang


 On 12/19/05, Kumar Limbu [EMAIL PROTECTED] wrote:
 
  Hi everyone,
 
  I have browsed through the nutch documentation but I have not found
 enough
  information on how to recrawl the urls that I have already crawled. Do
 we
  have to do a recrawling ourselves or the nutch application will do it?
 
  More information on this regard will be highly appreciated. Thank you
 very
  much.
 
  --
  Keep on smiling :) Kumar
 
 




--
Keep on smiling :) Kumar


Re: How to recrawl urls

2005-12-19 Thread Nguyen Ngoc Giang
  The scheme of intranet crawling is like this: Firstly, you create a webdb
using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector.
The seed URL is inserted into your webdb, marked by current date and time.
Then, you create a fetch list using FetchListTool. The FetchListTool read
all URLs in the webdb which are due to crawl, and put them to the fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling
is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb.
Newly extracted outlinks are set date and time to current date and time,
while all just-crawled URLs date and time are set to next 30 days (these
things happen actually in FetchListTool). So all extracted links will be
crawled for the next time, but not the just-crawled URLs. So on and so
forth.

  Therefore, once the crawler is still alive after 30 days (or the threshold
that you set), all just-crawled urls will be taken out to recrawl. That's
why we need to maintain a live crawler at that time. This could be done
using cron job, I think.

  Regards,
   Giang



On 12/20/05, Kumar Limbu [EMAIL PROTECTED] wrote:

 Hi Nguyen,

 Thank you for you information, but I would like to confirm that. I do see
 a
 variable that define the next fetch interval but I am not sure of it. If
 anyone has more information on this regard please let me know.

 Thank you in advance,




 On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:
 
  As I understand, by default, all links in Nutch are recrawled after 30
  days, as long as your Nutch process is still running. FetchListTool
 takes
  care of this setting. So maybe you can write a script (and put it in
  cron?)
  to reactivate the crawler.
 
  Regards,
Giang
 
 
  On 12/19/05, Kumar Limbu [EMAIL PROTECTED] wrote:
  
   Hi everyone,
  
   I have browsed through the nutch documentation but I have not found
  enough
   information on how to recrawl the urls that I have already crawled. Do
  we
   have to do a recrawling ourselves or the nutch application will do it?
  
   More information on this regard will be highly appreciated. Thank you
  very
   much.
  
   --
   Keep on smiling :) Kumar
  
  
 
 


 --
 Keep on smiling :) Kumar




Does Search Result Show Similar Pages Like Google?

2005-12-19 Thread Victor Lee
 Hi,
Does Nutch's search result show similar pages like Google?  I went to 
Modzex.com which is using Nutch but I don't see similar pages in its search 
result.
 
 Many thanks.
 

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com