injecting URLs with '?'
Hi, I'm indexing blog permalinks taken from a Roller Weblogger aggregator - like how Technorati does it. I noticed that 'inject' omits URLs with '?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner). How can I include these?
Re: injecting URLs with '?'
change: NUTCH/conf/regex-urlfilter.txt from: [EMAIL PROTECTED] to: [EMAIL PROTECTED] That's it. Stefan Am 19.12.2005 um 11:56 schrieb Miguel A Paraz: Hi, I'm indexing blog permalinks taken from a Roller Weblogger aggregator - like how Technorati does it. I noticed that 'inject' omits URLs with '?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner). How can I include these?
Re: nutch crawl fails with: org.apache.nutch.indexer.IndexingFilter does not exist.
Hi Jérôme, Many thanks for this email. I had found I needed 'nutch-extensionpoints', but with your explaination below I have a better understanding of the reason it is needed. Thanks once again. Stephen On 12/19/05, Jérôme Charron [EMAIL PROTECTED] wrote: nutch-extensionpoints is the plugin that defines all the nutch standard extension points: ie all the other plugins have a dependency on it. So, it is mandatory to include it in the list of activated plugin, or you must turn to true the plugin.auto-activation property, so that when a plugin is activated, all its dependencies will be automatically loaded. Jérôme
is nutch recrawl possible?
Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb with the fetched segment, that will not add new urls, or alternative use a url filter. You will find a lot of posts in the mail archive regarding this issues. Stefan Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
Actually i wanted to reuse the processings i do in a particular crawl for future crawls so as to avoid downloading pages which are not of my interest. Here is an example: 1. Suppose i am crawling http://www.abc.com website. 2. Then this gets injected in webdb and Fetchlist tool populates fetchlist in segment dir from webdb. 3. Then Fetcher creates FetcherThreads which download the content of this page. 4. Now once i download any page then i analyse the page and may be want to mark this page as blocked (as i find it useless for me) and store this information persistently so when i do a crawl next time for same site it remembers that i blocked it and hence it skips downloading this url. So basically its like this. I run a crawl and suppose out of total 100 pages i mark 60 pages as blocked then after this crawl finishes, i run same crawl again but this time i want those 60 urls not to be downloaded since i marked them as blocked. Actually my question can i do this somewhere in nutch? May be i assign very low or zero score to these urls and make my cutt off score above then this. But problem with crawl is everytime i do it, it requires the directory should not be already created and hence my prev data cant be used. But i think as you suggested me those steps i think those seem valuable and may be i will have to write my own CrawlTool to make it behave as i really need it so i think i got the clue and just need to work it out. Thanks for the valuable info and your precious time. Hope i am clearer this time :-) Regards Pushpesh For example suppose i crawl a website www.abc.com then i find some links in it and then suppose i assign my own score (i've done code changes for this already) to the urls found in www.abc.com site before fetching the contents On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb with the fetched segment, that will not add new urls, or alternative use a url filter. You will find a lot of posts in the mail archive regarding this issues. Stefan Am 19.12.2005 um 15:18 schrieb Pushpesh Kr. Rajwanshi: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh
Re: is nutch recrawl possible?
About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
Re: is nutch recrawl possible?
hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job. Thanks anyway Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
build instructions?
Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro
RE: build instructions?
Hello, I ran into the same problem (which I think is fixed in future releases). For Nutch 0.7.1, just create the missing directories and run the ant script again. HTH, DaveG -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Monday, December 19, 2005 2:38 PM To: nutch-user@lucene.apache.org Subject: build instructions? Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro
Re: build instructions?
This is a known bug. Just create a empty folder ...\src\plugin\nutch-extensionpoints\src\java and it will work. This is fixed in latest trunk that you can checkout form apache's subversion server. Stefan Am 19.12.2005 um 20:38 schrieb Teruhiko Kurosaka: Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: build instructions?
Teruhiko Kurosaka wrote: Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java mkdir -p that directory and try again. If you're tracking your build in a local CVS, it's handy to add those dirs to your local CVS. This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro -- Jed Reynolds System Administrator, PRWeb International, Inc. 360-312-0892
Re: build instructions?
It is a known bug in 0.7.1 distribution. You can get the sources directly from svn and it build fine. It is also fixed in preparation for 0.7.2 release and in trunk. Or you can fix it locally by creating empty src/java folder I am not sure if it is the only one empty folder missing in nutch-extensionpoints folder but there should be not so many of them. Regards Piotr Teruhiko Kurosaka wrote: Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro
RE: build instructions?
Thank you, everybody. I can build now! -Original Message- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: 2005年12月19日 11:42 To: nutch-user@lucene.apache.org Subject: RE: build instructions? Hello, I ran into the same problem (which I think is fixed in future releases). For Nutch 0.7.1, just create the missing directories and run the ant script again. HTH, DaveG -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Monday, December 19, 2005 2:38 PM To: nutch-user@lucene.apache.org Subject: build instructions? Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried JDK 1.4.1 but I saw the same failure.) -Kuro
Re: is nutch recrawl possible?
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters otherwise url filters would have done great job. Thanks anyway Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me writing code for stuff i need :-) Thanks and Regards, Pushpesh On 12/19/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote: Try using the whole-web fetching method instead of the crawl method. http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling http://wiki.media-style.com/display/nutchDocu/quick+tutorial Pushpesh Kr. Rajwanshi wrote: Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb contents which it populated first time. (Assuming a successful crawl. Yea you are right a suddenly broken down crawl wont work as it has lost its integrity of data) As you said we can run tools provided by nutch to do step by step commands needed to crawl, but isnt there some way i can reuse the existing crawl data? May be it involves changing code but thats ok. Just one more quick question, why every crawl needs a new directory and there isnt an option to alteast reuse the webdb? May be i am asking something silly but i am clueless :-( Or as you said may be what i can do is to explore the steps u mentioned and get what i need. Thanks again, Pushpesh On 12/19/05, Stefan Groschupf [EMAIL PROTECTED] wrote: It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run these steps manually instead of using the crawl command. Than you may will already get an idea where you can jump in to grep your needed data. Stefan Am 19.12.2005 um 14:46 schrieb Pushpesh Kr. Rajwanshi: Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can resume the crawl ? I dont know even if i can do this at all or if there is some way then please throw some light on this. TIA Regards, Pushpesh No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005 No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.1/206 - Release Date: 16.12.2005
Appropriate steps for mapred
I have followed the tutorial at media-style.com and actually have a mapred installation of nutch working. Thanks Stefan :) My question now is the correct steps to continuously fetch and index. I have read some people talking about mergesegs and updatedb however Stefan's tutorial doesn't list these as steps. If you want to continually fetch more and more levels from your crawldb and appropriately update your index what is the correct method for doing so? Currently I am doing this: generate fetch invertlinks index Only problem I am having is that I seem to not be able to get any pages past the index pages on the root domains I injected. I feel like I am missing some important steps. Any input is appreciated. Mike
Re: Appropriate steps for mapred
Stefan's tutorial doesn't list these as steps. I will add these steps hopefully until this year. If you want to continually fetch more and more levels from your crawldb and appropriately update your index what is the correct method for doing so? Currently I am doing this: generate fetch invertlinks index Looks like you missed to update the crawldb after fetching, but in general that is the way to go. You can run this cycle 10 times or more :). I suggest have big enough segments size and later merging some indexes together. Just play around and try it out. The size of segments and how many segment indexes you should merge very much depends on your hardware. Also note that searching from a index stored on ndfs is slow, but there will be a solution for that until next weeks or so. HTH Stefan
Multiple anchors on same site - what's better than making these unique?
Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. I'm using Nutch for an intranet style search engine, on a single site, so I don't really care about the uniqueness by domain. However, I can't help thinking that the uniqueness by anchor text probably isn't what I want. Suppose my site has 3 pages with links to page X, and the same anchor text. I'd kind of like to score page X higher than a page where there's only one incoming link with that anchor text. But I don't want to have this effect swamping the other calculations of page score. In other words, if my site has 1000 pages with links to page X, this page should score a wee bit higher than a similar page with just one incoming link, but not 1000 times higher. I'm thinking of doing some maths with the number of repetitions of an anchor, then including the result in the page score. Something like log(10+n), or maybe n/(n+2); where n is the number of incoming links with the same anchor text. Either of these formulas would make 1000 incoming links score roughly 3 times higher than a single incoming link, which seems about right to me. It looks to me like I'm going to have to make changes deep within the Lucene page scoring stuff to do this, which I'm not really looking forward to. I'd really welcome hearing if anybody has a better solution to this general problem. The exact maths isn't too critical. What's important is that for small values of n, the page score must increase as n increases, but the overall effect must diminish as n gets really large. Thanks in advance, David. This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network.
Re: Multiple anchors on same site - what's better than making these unique?
Hi, did you tried... property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links. /description /property ... setting to false? Stefan Am 20.12.2005 um 00:49 schrieb David Wallace: Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. I'm using Nutch for an intranet style search engine, on a single site, so I don't really care about the uniqueness by domain. However, I can't help thinking that the uniqueness by anchor text probably isn't what I want. Suppose my site has 3 pages with links to page X, and the same anchor text. I'd kind of like to score page X higher than a page where there's only one incoming link with that anchor text. But I don't want to have this effect swamping the other calculations of page score. In other words, if my site has 1000 pages with links to page X, this page should score a wee bit higher than a similar page with just one incoming link, but not 1000 times higher. I'm thinking of doing some maths with the number of repetitions of an anchor, then including the result in the page score. Something like log(10+n), or maybe n/(n+2); where n is the number of incoming links with the same anchor text. Either of these formulas would make 1000 incoming links score roughly 3 times higher than a single incoming link, which seems about right to me. It looks to me like I'm going to have to make changes deep within the Lucene page scoring stuff to do this, which I'm not really looking forward to. I'd really welcome hearing if anybody has a better solution to this general problem. The exact maths isn't too critical. What's important is that for small values of n, the page score must increase as n increases, but the overall effect must diminish as n gets really large. Thanks in advance, David. ** ** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. ** ** --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: Multiple anchors on same site - what's better than making these unique?
Thank you Stefan, for your speedy response. I have indeed changed that setting to false. However, that doesn't deal with my problem. The offending method is getAnchors in org.apache.nutch.db.WebDBAnchors, which is called from org.apache.nutch.tools.FetchListTool. This method makes the array of anchors unique, for the FetchListEntry (unless of course, the incoming links are from different domains); and does so regardless of any NutchConf setting. If I changed the WebDBAnchors class, in order to disable this uniqueness; I'd then need to incorporate some kind of numerical fudging into the scoring. This is to prevent the scores being badly skewed in the cases where I have a page with a large number of incoming links, all with the same anchor text. This is likely to occur for pages that have links in my site's navigation chrome, for example. I suspect I shall have to bite the bullet, and start studying Lucene's internal mathematics. Regards, David. Stefan Groschupf wrote: Hi, did you tried... property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links. /description /property ... setting to false? Stefan Am 20.12.2005 um 00:49 schrieb David Wallace: Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. [ some snipped ] This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network.
Re: How to recrawl urls
Hi Nguyen, Thank you for you information, but I would like to confirm that. I do see a variable that define the next fetch interval but I am not sure of it. If anyone has more information on this regard please let me know. Thank you in advance, On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: As I understand, by default, all links in Nutch are recrawled after 30 days, as long as your Nutch process is still running. FetchListTool takes care of this setting. So maybe you can write a script (and put it in cron?) to reactivate the crawler. Regards, Giang On 12/19/05, Kumar Limbu [EMAIL PROTECTED] wrote: Hi everyone, I have browsed through the nutch documentation but I have not found enough information on how to recrawl the urls that I have already crawled. Do we have to do a recrawling ourselves or the nutch application will do it? More information on this regard will be highly appreciated. Thank you very much. -- Keep on smiling :) Kumar -- Keep on smiling :) Kumar
Re: How to recrawl urls
The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist. Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set date and time to current date and time, while all just-crawled URLs date and time are set to next 30 days (these things happen actually in FetchListTool). So all extracted links will be crawled for the next time, but not the just-crawled URLs. So on and so forth. Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all just-crawled urls will be taken out to recrawl. That's why we need to maintain a live crawler at that time. This could be done using cron job, I think. Regards, Giang On 12/20/05, Kumar Limbu [EMAIL PROTECTED] wrote: Hi Nguyen, Thank you for you information, but I would like to confirm that. I do see a variable that define the next fetch interval but I am not sure of it. If anyone has more information on this regard please let me know. Thank you in advance, On 12/19/05, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote: As I understand, by default, all links in Nutch are recrawled after 30 days, as long as your Nutch process is still running. FetchListTool takes care of this setting. So maybe you can write a script (and put it in cron?) to reactivate the crawler. Regards, Giang On 12/19/05, Kumar Limbu [EMAIL PROTECTED] wrote: Hi everyone, I have browsed through the nutch documentation but I have not found enough information on how to recrawl the urls that I have already crawled. Do we have to do a recrawling ourselves or the nutch application will do it? More information on this regard will be highly appreciated. Thank you very much. -- Keep on smiling :) Kumar -- Keep on smiling :) Kumar
Does Search Result Show Similar Pages Like Google?
Hi, Does Nutch's search result show similar pages like Google? I went to Modzex.com which is using Nutch but I don't see similar pages in its search result. Many thanks. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com