injecting URLs with '?'

2005-12-19 Thread Miguel A Paraz
Hi, I'm indexing blog permalinks taken from a Roller Weblogger aggregator - like how Technorati does it. I noticed that 'inject' omits URLs with '?' - blog URLs like ?p=100 (Wordpress) and ?m=100 (Feedburner). How can I include these?

Re: injecting URLs with '?'

2005-12-19 Thread Stefan Groschupf
change: NUTCH/conf/regex-urlfilter.txt from: [EMAIL PROTECTED] to: [EMAIL PROTECTED] That's it. Stefan Am 19.12.2005 um 11:56 schrieb Miguel A Paraz: Hi, I'm indexing blog permalinks taken from a Roller Weblogger aggregator - like how Technorati does it. I noticed that 'inject' omits URLs

Re: nutch crawl fails with: org.apache.nutch.indexer.IndexingFilter does not exist.

2005-12-19 Thread Stephen Fitch
Hi Jérôme, Many thanks for this email. I had found I needed 'nutch-extensionpoints', but with your explaination below I have a better understanding of the reason it is needed. Thanks once again. Stephen On 12/19/05, Jérôme Charron [EMAIL PROTECTED] wrote: nutch-extensionpoints is the plugin

is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi, I am crawling some sites using nutch. My requirement is, when i run a nutch crawl, then somehow it should be able to reuse the data in webdb populated in previous crawl. In other words my question is suppose my crawl is running and i cancel it somewhere in middle, then is there someway i can

Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
It is difficult to answer your question since the used vocabulary is may wrong. You can refetch pages, no problem. But you can not continue a crashed fetch process. Nutch provides a tool that runs a set of steps like, segment generation, fetching, db updateting etc. So may first try to run

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Hi Stefan, Thanks for lightening fast reply. I was amazed to see such quick response really appreciate it. Actually what i am really looking is, suppose i run a crawl for sometime sites say 5 and for some depth say 2. Then what i want is next time i run a crawl it should re use the webdb

Re: is nutch recrawl possible?

2005-12-19 Thread Stefan Groschupf
Still do not clearly understand you plans, sorry. However pages from the webdb are recrawled every 30 days (but configurable in the nutch- default.xml). The new folder are so called segments and you can put it to the trash after 30 days. So what you can do is first never updated your webdb

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
Actually i wanted to reuse the processings i do in a particular crawl for future crawls so as to avoid downloading pages which are not of my interest. Here is an example: 1. Suppose i am crawling http://www.abc.com website. 2. Then this gets injected in webdb and Fetchlist tool populates

Re: is nutch recrawl possible?

2005-12-19 Thread Håvard W. Kongsgård
About this blocking you can try to use the urlfilters, change the filter between each fetch/generate +^http://www.abc.com -^http://www.bbc.co.uk Pushpesh Kr. Rajwanshi wrote: Oh this is pretty good and quite helpful material i wanted. Thanks Havard for this. Seems like this will help me

Re: is nutch recrawl possible?

2005-12-19 Thread Pushpesh Kr. Rajwanshi
hmmm... actually my requirement is a bit more complex than it seems so url filters alone probably would do. Because i am not filtering urls based only on some domain name but within domain i want to discard some urls, and since they actually dont follow a pattern hence i cant use url filters

build instructions?

2005-12-19 Thread Teruhiko Kurosaka
Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java This is Nutch 0.7.1 download and I'm trying to build on Windows XP Professional with Cygwin and JDK 1.5. (I tried

RE: build instructions?

2005-12-19 Thread Goldschmidt, Dave
Hello, I ran into the same problem (which I think is fixed in future releases). For Nutch 0.7.1, just create the missing directories and run the ant script again. HTH, DaveG -Original Message- From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] Sent: Monday, December 19, 2005 2:38 PM

Re: build instructions?

2005-12-19 Thread Stefan Groschupf
This is a known bug. Just create a empty folder ...\src\plugin\nutch-extensionpoints\src\java and it will work. This is fixed in latest trunk that you can checkout form apache's subversion server. Stefan Am 19.12.2005 um 20:38 schrieb Teruhiko Kurosaka: Where can I find the build

Re: build instructions?

2005-12-19 Thread Jed Reynolds
Teruhiko Kurosaka wrote: Where can I find the build instructions for Nutch? Just typing ant ended with an error complaining that there is no such directory as ...\src\plugin\nutch-extensionpoints\src\java mkdir -p that directory and try again. If you're tracking your build in a local CVS,

Re: build instructions?

2005-12-19 Thread Piotr Kosiorowski
It is a known bug in 0.7.1 distribution. You can get the sources directly from svn and it build fine. It is also fixed in preparation for 0.7.2 release and in trunk. Or you can fix it locally by creating empty src/java folder I am not sure if it is the only one empty folder missing in

RE: build instructions?

2005-12-19 Thread Teruhiko Kurosaka
Thank you, everybody. I can build now! -Original Message- From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED] Sent: 2005年12月19日 11:42 To: nutch-user@lucene.apache.org Subject: RE: build instructions? Hello, I ran into the same problem (which I think is fixed in future releases).

Re: is nutch recrawl possible?

2005-12-19 Thread Florent Gluck
Pushpesh, We extended nutch with a whitelist filter and you might find it useful. Check the comments from Matt Kangas here: http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all --Flo Pushpesh Kr. Rajwanshi wrote: hmmm... actually my requirement is

Appropriate steps for mapred

2005-12-19 Thread Michael Taggart
I have followed the tutorial at media-style.com and actually have a mapred installation of nutch working. Thanks Stefan :) My question now is the correct steps to continuously fetch and index. I have read some people talking about mergesegs and updatedb however Stefan's tutorial doesn't list these

Re: Appropriate steps for mapred

2005-12-19 Thread Stefan Groschupf
Stefan's tutorial doesn't list these as steps. I will add these steps hopefully until this year. If you want to continually fetch more and more levels from your crawldb and appropriately update your index what is the correct method for doing so? Currently I am doing this: generate fetch

Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. I'm using Nutch for an intranet style search engine, on a single site, so I don't

Re: Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread Stefan Groschupf
Hi, did you tried... property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links.

Re: Multiple anchors on same site - what's better than making these unique?

2005-12-19 Thread David Wallace
Thank you Stefan, for your speedy response. I have indeed changed that setting to false. However, that doesn't deal with my problem. The offending method is getAnchors in org.apache.nutch.db.WebDBAnchors, which is called from org.apache.nutch.tools.FetchListTool. This method makes the array

Re: How to recrawl urls

2005-12-19 Thread Kumar Limbu
Hi Nguyen, Thank you for you information, but I would like to confirm that. I do see a variable that define the next fetch interval but I am not sure of it. If anyone has more information on this regard please let me know. Thank you in advance, On 12/19/05, Nguyen Ngoc Giang [EMAIL

Re: How to recrawl urls

2005-12-19 Thread Nguyen Ngoc Giang
The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool

Does Search Result Show Similar Pages Like Google?

2005-12-19 Thread Victor Lee
Hi, Does Nutch's search result show similar pages like Google? I went to Modzex.com which is using Nutch but I don't see similar pages in its search result. Many thanks. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam