Re: Crawling a file but not indexing it

2006-04-05 Thread TDLN
Benjamin, you could add this to the HtmlParser and BasicIndexingFilter, but maybe it is best to create your own plugin. Along the lines of the WritingPluginExample: 1) add a metatag to your seed Pages meta name=indexed content=no / 2) create a ParseFilter that extends HtmlParseFilter and

Crawl status

2006-04-05 Thread Fabrice Estiévenart
Hello, One week ago, I launched a crawl on a list of domains with a depth of 10. My crawler is still running now. How can I have the status of the crawl process ? (number of fetched/indexed pages, current depth of crawl, percentage of tasks realised...and any other useful information).

Re: Crawl status

2006-04-05 Thread TDLN
How can I have the status of the crawl process ? In general this should be apparent from the crawl log. - number of fetched pages is printed to the logs at certain intervals (also number of pages/sec etc.) - number of indexed pages if you use the crawl too, indexing is done after all pages

RE: Crawl status

2006-04-05 Thread Dan Morrill
Fabrice, Personally I am tailing the crawl log to find that out. About every 100 pages it gives out the amount of pages in total and pages per second and line speed. Hope that helps. r/d -Original Message- From: Fabrice Estiévenart [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05,

generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Byron Miller
Got the following dump at 100% of generate cycle (.8 svn release) 060405 080019 parsing file:/home/mozdex/trunk/conf/nutch-site.xml 060405 080019 parsing file:/home/mozdex/trunk/conf/hadoop-site.xml Exception in thread main java.lang.RuntimeException: class

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Jérôme Charron
Andrzej fixed it 2 hours ago. http://svn.apache.org/viewcvs.cgi?rev=391577view=rev Thanks Jérôme On 4/5/06, Byron Miller [EMAIL PROTECTED] wrote: Got the following dump at 100% of generate cycle (.8 svn release) 060405 080019 parsing file:/home/mozdex/trunk/conf/nutch-site.xml 060405

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Byron Miller
hehe, just pulled it down and trying again :) thanks! --- J�r�me Charron [EMAIL PROTECTED] wrote: Andrzej fixed it 2 hours ago. http://svn.apache.org/viewcvs.cgi?rev=391577view=rev Thanks J�r�me On 4/5/06, Byron Miller [EMAIL PROTECTED] wrote: Got the following dump at

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Andrzej Bialecki
Byron Miller wrote: Got the following dump at 100% of generate cycle (.8 svn release) Just fixed this. Sorry. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || |

Re: Adaptive Refetch

2006-04-05 Thread Mehmet Tan
Andrzej, Thanks for your response and patch. But I have a few more questions about adaptive refetch. As far as I understood the solution below is 'not to overwrite some fields of the entries' in the db. Assume we applied the adaptive refetch idea in your patch to the 0.7 version. We have the

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki
Mehmet Tan wrote: Andrzej, Thanks for your response and patch. But I have a few more questions about adaptive refetch. As far as I understood the solution below is 'not to overwrite some fields of the entries' in the db. Assume we applied the adaptive refetch idea in your patch to the 0.7

Re: Adaptive Refetch

2006-04-05 Thread Mehmet Tan
Sorry but I am not sure I could explain the problem properly. What I am trying to ask is this: You have pages A,B,C,D in webdb and then you come to a page E during the crawl and page E redirects you to page A for example. Then you create a new Page object in the fetcher with url A and write

details: stackoverflow error

2006-04-05 Thread Rajesh Munavalli
I had earlier posted this message to the list but havent got any response. Here are more details. Nutch versionI: nutch.0.7.2 URL File: contains a single URL. File name: urls Crawl-url-filter: is set to grab all URLs Command: bin/nutch crawl urls -dir crawl.test -depth 3 Error:

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki
Mehmet Tan wrote: Sorry but I am not sure I could explain the problem properly. What I am trying to ask is this: You have pages A,B,C,D in webdb and then you come to a page E during the crawl and page E redirects you to page A for example. Then you create a new Page object in the fetcher

Info on scoring/indexing and pagerank

2006-04-05 Thread Insurance Squared Inc.
Hi All, Two general questions: - I'm wondering if there are any good sources of written information on actually writing a search engine script. Things like scoring, indexing, that kind of stuff. I bought the lucene book, but that's lucene specific technical info. Looking for something at

Nutch 500 Error

2006-04-05 Thread Paul Stewart
Hi there... I was having a number of problems with my install, mainly because I'm not used to Tomcat and/or Nutch etc... Anyways, I am running Fedora 4 and was told that the packages are bad idea to use so uninstalled all of my java/tomcat rpm's and installed new binaries today from the source