spam detect
Hello! Does nutch have any modules for spam detect? Does anyone know where I can find any information (blogs, articles, FAQ) about it?
RE: How to get score in search.jsp
I have found solution. I've add variable score into Hit -Original Message- From: Anton Potekhin [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 14, 2007 10:48 AM To: nutch-dev@lucene.apache.org Subject: How to get score in search.jsp Importance: High Hi Nutch Gurus! I have a small problem. I need to add some changes into search.jsp. I need to get first 50 results and to sort them in different way. I will change the score of each result with formula new_score = nutch_score + domain_score_from_my_db to sort. But i don't understand how to get nutch_score in search.jsp Now I use a makeshift. I get the nutch_score using getValue() method of org.apache.lucene.search.Explanation class. But i think it is a very slow way. Can anybody help me to find a solution for this problem? P.S. I hope that I described my problem clearly. Thanks in advance. Sorry for the duplicated mail. I think I had some problems with my mail account
How to get score in search.jsp
Hi Nutch Gurus! I have a small problem. I need to add some changes into search.jsp. I need to get first 50 results and to sort them in different way. I will change the score of each result with formula new_score = nutch_score + domain_score_from_my_db to sort. But i don't understand how to get nutch_score in search.jsp Now I use a makeshift. I get the nutch_score using getValue() method of org.apache.lucene.search.Explanation class. But i think it is a very slow way. Can anybody help me to find a solution for this problem? P.S. I hope that I described my problem clearly. Thanks in advance. Sorry for the duplicated mail. I think I had some problems with my mail account
deep limitation
Does Nutch 0.7.2 have any deep limitation? I added a few pages. I need processing this pages and all pages which located 3 (for example) clicks away from added pages. I think, I explain clearly ;-)
RE: indexing problem
Nutch is not compatible with latest hadoop from svn. Nutch works coorect after small tuning with latest hadoop from svn ;-)
indexing problem
I've got latest versions of nutch (0.9-dev) and hadoop (Trunk) from svn. When I try to index I get the next error: java.lang.ClassCastException: org.apache.nutch.parse.ParseData at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.java:92) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:184) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:196) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075) This exception is raised from method next(Writable key, Writable value) of class SequenceFileRecordReader. Method 'next' is called with 'value' parameter that have different class for each its call (classes are crawlDatum, ParseData or Inlinks). And when these classes (crawlDatum, ParseData or Inlinks) are cast I get classCastException. Why do I get this exception? I looked at old sources but didn't find distinctions in algorithm. What do I miss?
limitation
How to limit the pages number processed from each domain? And how to setup nutch to crawl only domains added by me (i.e. make nutch to ignore external links)? If nutch doesn't allow it then what algorithm will be the best for it? p.s. nutch ver.0.7
Fetch error
I update hadoop but I am get next error now on fetch step (reduce): 06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334% reduce copy (6 of 6 at 11.77 MB/s) 06/08/29 08:31:20 WARN /: /getMapOutput.jsp?map=task_0003_m_02_0reduce=1: java.lang.IllegalStateException at org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse. java:561) at org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122) at org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115) at org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190) at org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto ryImpl.java:115) at org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j ava:75) at org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3 24) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl er.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext .java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) How I can fixed this? While on generate step all works right but on fetch reduce I get error and task faild?
RE: Fetch error
Preview error I got from tasktracker log. In jobtracker log I am see next error now: 06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS ystem;Lorg/apache/hadoop/mapred/JobConf;Ljava/ lang/String;Lorg/apache/hadoop/util/Progressable;)Lorg/apache/hadoop/mapred/ RecordWriter; at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:297) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 30, 2006 12:17 PM To: nutch-dev@lucene.apache.org Subject: Fetch error Importance: High I update hadoop but I am get next error now on fetch step (reduce): 06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334% reduce copy (6 of 6 at 11.77 MB/s) 06/08/29 08:31:20 WARN /: /getMapOutput.jsp?map=task_0003_m_02_0reduce=1: java.lang.IllegalStateException at org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse. java:561) at org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122) at org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115) at org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190) at org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto ryImpl.java:115) at org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j ava:75) at org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3 24) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl er.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext .java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) How I can fixed this? While on generate step all works right but on fetch reduce I get error and task faild?
RE: problem with nutch
I tried start job tracker without tomcat. -Original Message- From: Chris Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 23, 2006 6:16 PM To: nutch-dev@lucene.apache.org Subject: Re: problem with nutch Importance: High This is probably a better question for the user list. nutch-user@lucene.apache.org It looks like your trying to bind Tomcat to a port that is already being used. Edit your configuration file and change the default port (usually 8080) to something that is available on that server. [EMAIL PROTECTED] wrote: When I try start nutch 0.8 I get errors. How I can solve this problem? JobTracker log: ...Skiped... 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is little 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases 06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4 06/08/23 05:19:41 INFO util.Container: Started [EMAIL PROTECTED] 06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/] 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not found
RE: problem with nutch
If be exacеt. When I started job tracker on given server was loaded only namenode. All ports from hadoop-default.xml not used. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 10:48 AM To: nutch-dev@lucene.apache.org Subject: RE: problem with nutch Importance: High I tried start job tracker without tomcat. -Original Message- From: Chris Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 23, 2006 6:16 PM To: nutch-dev@lucene.apache.org Subject: Re: problem with nutch Importance: High This is probably a better question for the user list. nutch-user@lucene.apache.org It looks like your trying to bind Tomcat to a port that is already being used. Edit your configuration file and change the default port (usually 8080) to something that is available on that server. [EMAIL PROTECTED] wrote: When I try start nutch 0.8 I get errors. How I can solve this problem? JobTracker log: ...Skiped... 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is little 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases 06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4 06/08/23 05:19:41 INFO util.Container: Started [EMAIL PROTECTED] 06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/] 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not found
RE: problem with nutch
In Addition please draw attention on next part of log: 06/08/25 05:07:59 WARN servlet.WebApplicationContext: Web application not found /spider_kakle_mapred/spider/conf:/spider_ 06/08/25 05:07:59 WARN servlet.WebApplicationContext: Configuration error on /spider_kakle_mapred/spider/conf:/spider_kak java.io.FileNotFoundException: /spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools. jar:/spider_ka at org.mortbay.jetty.servlet.WebApplicationContext.resolveWebApp(WebApplication Context.java:266) at org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContex t.java:449) at org.mortbay.util.Container.start(Container.java:72) at org.mortbay.http.HttpServer.doStart(HttpServer.java:753) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:172) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:461) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1143) 06/08/25 05:07:59 INFO util.Container: Started HttpContext[/logs,/logs] 06/08/25 05:07:59 INFO util.Container: Started HttpContext[/static,/static] 06/08/25 05:07:59 INFO http.SocketListener: Started SocketListener on 0.0.0.0:8010 06/08/25 05:07:59 WARN mapred.JobTracker: Starting tracker java.io.IOException: Problem starting http server at org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:195) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:461) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1143) Caused by: org.mortbay.util.MultiException[java.io.FileNotFoundException: /spider_kakle_mapred/spider/conf:/spider_kakle_ at org.mortbay.http.HttpServer.doStart(HttpServer.java:731) In this I see exception (java.io.FileNotFoundException). What file was not found? What I can do?
problem with nutch
When I try start nutch 0.8 I get errors. How I can solve this problem? JobTracker log: ...Skiped... 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is little 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases 06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4 06/08/23 05:19:41 INFO util.Container: Started [EMAIL PROTECTED] 06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/] 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not found /spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools. jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred /spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar :/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider /lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_ kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp ider/lib/jetty-ext/jsp-api.jar 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Configuration error on /spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools. jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred /spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar :/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider /lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_ kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp ider/lib/jetty-ext/jsp-api.jar java.io.FileNotFoundException: /spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools. jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred /spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar :/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider /lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_ kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp ider/lib/jetty-ext/jsp-api.jar at org.mortbay.jetty.servlet.WebApplicationContext.resolveWebApp(WebApplication Context.java:266) at org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContex t.java:449) at org.mortbay.util.Container.start(Container.java:72) at org.mortbay.http.HttpServer.doStart(HttpServer.java:753) at org.mortbay.util.Container.start(Container.java:72) at org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:154) at
some questions
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried about nutch's requirements to HDD free space. For example, suppose I have 1) server with job tracker and namenode 2) 5 servers with task trackers and 20 Gb HDDs 3) 5 servers with datenode and 20 Gb HDDs also (DFS, the replication will be equal 1) There are some questions: 1) Is this HDD space enough to run task trackers? 2) How to calculate the approximate free HDD space needed for servers with task trackers, servers with with job trackers and name node? 3) Will I be able to increase the data storage space while increasing the number of servers with date node? Or will it not be enough to increase the number of date nodes?
RE: nutch
My settings: property namemapred.local.dir/name value/hadoop/mapred/local/value descriptionThe local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. /description /property property namemapred.system.dir/name value/hadoop/mapred/system/value descriptionThe shared directory where MapReduce stores control files. /description /property My device which mounted onto / have free space is 115G. [EMAIL PROTECTED] /]# df -h FilesystemSize Used Avail Use% Mounted on /dev/sda2 133G 13G 113G 11% / Anybody have other ideas? -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 02, 2006 6:01 PM To: nutch-dev@lucene.apache.org Subject: Re: nutch Importance: High most propably you have run out of space in tmp (local) filesystem use properties like property namemapred.system.dir/name value!-- path to fs that contains a lots of space --/value /property property namemapred.local.dir/name value!-- path to fs that contains a lots of space --/value /property in hadoop-site.xml to get over this problem. [EMAIL PROTECTED] wrote: I forget ;-) One more question: This problem with nutch or hadoop? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 02, 2006 11:38 AM To: nutch-dev@lucene.apache.org Subject: nutch Importance: High I use nutch 0.8(mapred). Nutch started on 3 servers. When my nutch try index segment I get error on tasktracker: skiped
Problem opening checksum file
I create file on dfs (for example filename done). After I try copy this file from dfs to local filesystem. In result I get this file in local filesystem and error: Problem opening checksum file: /user/root/crawl/done. Ignoring with exception org.apache.hadoop.ipc.RemoteException: jav a.io.IOException: Cannot open filename /user/root/crawl/done.crc at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130) at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm pl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) for create file I use code: FileSystem fs = ... fs_.createNewFile(new Path(segments[i], already_indexed)); for copy file to local filesystem I use code: fs.copyToLocalFile(...,...); How create crc file? Why crc file is not created automatically when making a file on dfs? How correctly create a file on dfs?
search speed
I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence on search speed?
free disk space
I'm using nutch v.0.8 and have 3 computers. Two of them have datanode and tasktracker running, another one has name node and jobtracker running. Do I need more disk space with tasktrackers and jobtracker running, as the number of pages processed is growing along with the size of database? Would I be able to add the 3d datanode when I run out of free disk space on those computers with datanode installed? How much free disk space do I need in order for task- and jobtrackers to work properly?
No space left on device
I'm using nutch v.0.8 and have 3 computers. One of my tasktrakers always go down. This occurs during indexing (index crawl/indexes). On server with crashed tasktracker now available 53G of free disk space and used only 11G. How i can decide this problem? Why tasktarcker requires so much free space on HDD? Piece of Log with error: 060613 151840 task_0083_r_01_0 0.5% reduce sort 060613 151841 task_0083_r_01_0 0.5% reduce sort 060613 151842 task_0083_r_01_0 0.5% reduce sort 060613 151843 task_0083_r_01_0 0.5% reduce sort 060613 151844 task_0083_r_01_0 0.5% reduce sort 060613 151845 task_0083_r_01_0 0.5% reduce sort 060613 151846 task_0083_r_01_0 0.5% reduce sort 060613 151847 task_0083_r_01_0 0.5% reduce sort 060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 SEVERE FSError from child 060613 151847 task_0083_r_01_0 org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Syst 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java :69) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre am.j 060613 151847 task_0083_r_01_0 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 060613 151847 task_0083_r_01_0 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) 060613 151847 task_0083_r_01_0 at java.io.DataOutputStream.flush(DataOutputStream.java:106) 060613 151847 task_0083_r_01_0 at java.io.FilterOutputStream.close(FilterOutputStream.java:140) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59 8) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316) 060613 151847 task_0083_r_01_0 060613 151847 task_0083_r_01_0 at org.apache.hadoop.mapred.TaskTracker$Chi 060613 151847 task_0083_r_01_0 Caused by: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 at java.io.FileOutputStream.writeBytes(Native Method) 060613 151847 task_0083_r_01_0 at java.io.FileOutputStream.write(FileOutputStream.java:260) 060613 151848 task_0083_r_01_0 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Syst 060613 151848 task_0083_r_01_0 ... 11 more 060613 151849 Server connection on port 50050 from 10.0.0.3: exiting 060613 151854 task_0083_m_01_0 done; removing files. 060613 151855 task_0083_m_03_0 done; removing files.
RE: No space left on device
Yes, I use dfs. How configure nutch for decide problem with disk space? How control number of smaller files? -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 14, 2006 5:46 PM To: nutch-dev@lucene.apache.org Subject: Re: No space left on device Importance: High The tasktracker require intermediate space while performing the map and reduce functions. Many smaller files are produced during the map and reduce processes that are deleted when the processes finish. If you are using the DFS then more disk space is required then is actually used since disk space is grabbed in blocks. Dennis [EMAIL PROTECTED] wrote: I'm using nutch v.0.8 and have 3 computers. One of my tasktrakers always go down. This occurs during indexing (index crawl/indexes). On server with crashed tasktracker now available 53G of free disk space and used only 11G. How i can decide this problem? Why tasktarcker requires so much free space on HDD? Piece of Log with error: 060613 151840 task_0083_r_01_0 0.5% reduce sort 060613 151841 task_0083_r_01_0 0.5% reduce sort 060613 151842 task_0083_r_01_0 0.5% reduce sort 060613 151843 task_0083_r_01_0 0.5% reduce sort 060613 151844 task_0083_r_01_0 0.5% reduce sort 060613 151845 task_0083_r_01_0 0.5% reduce sort 060613 151846 task_0083_r_01_0 0.5% reduce sort 060613 151847 task_0083_r_01_0 0.5% reduce sort 060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 SEVERE FSError from child 060613 151847 task_0083_r_01_0 org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Syst 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java :69) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre am.j 060613 151847 task_0083_r_01_0 at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 060613 151847 task_0083_r_01_0 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) 060613 151847 task_0083_r_01_0 at java.io.DataOutputStream.flush(DataOutputStream.java:106) 060613 151847 task_0083_r_01_0 at java.io.FilterOutputStream.close(FilterOutputStream.java:140) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59 8) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519) 060613 151847 task_0083_r_01_0 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316) 060613 151847 task_0083_r_01_0 060613 151847 task_0083_r_01_0 at org.apache.hadoop.mapred.TaskTracker$Chi 060613 151847 task_0083_r_01_0 Caused by: java.io.IOException: No space left on device 060613 151847 task_0083_r_01_0 at java.io.FileOutputStream.writeBytes(Native Method) 060613 151847 task_0083_r_01_0 at java.io.FileOutputStream.write(FileOutputStream.java:260) 060613 151848 task_0083_r_01_0 at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile Syst 060613 151848 task_0083_r_01_0 ... 11 more 060613 151849 Server connection on port 50050 from 10.0.0.3: exiting 060613 151854 task_0083_m_01_0 done; removing files. 060613 151855 task_0083_m_03_0 done; removing files.
RE: resolving IP in...
Anyone knows where can I download the nutch version 0.8? I can't find this one :( http://svn.apache.org/repos/asf/lucene/nutch/trunk/
summary
My Nutch processed pages http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm. When I try make search for search term lingerie nutch bring up results with bad summary (... Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie ...). Please, help me in the decision of my problem...
RE: summary
It's not a problem of Nutch! Do you Try a spamdexing ? Yes. I understand this... But how fight with this spam? -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : lundi 5 juin 2006 11:43 À : nutch-dev@lucene.apache.org Objet : summary My Nutch processed pages http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm. When I try make search for search term lingerie nutch bring up results with bad summary (... Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie ...). Please, help me in the decision of my problem...
error
I updated any plugins... And now I get errors in tomcat log: May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin (summary-basic), extension point: org.apache.nutch.searcher.Summarizer does not exist. How fix this problem?
to count the number of pages from each domain
We tried to develop a solution to count the number of pages from each domain. We thought to do it so: .map - had following input k - UTF8 (url of page), v - CrawlDatum and following output k - UTF8 (domain of page), v - UrlAndPage implemented Writable (structure which contained url of page and its CrawlDatum) .reduce - had following input k - UTF8 (domain of page), v - iterator for list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum .in map function we parsed domain from url, created UrlAndPage structure and put them to OutputCollector .in reduce we counted how many elements are in list connected with iterator, and put it into each CrawlDatum, then we formed new pairs of k, v (url, CrawlDatum) and put them to OutputCollector Following problem has arisen: as far as we see the types of input and output of map and reduce should be same, but in our case they were different and it caused the error like this: 060505 183200 task_0104_m_00_3 java.lang.RuntimeException: java.lang.InstantiationException: org.apache.nutch.crawl.PostUpdateFilter$UrlAn dPage 060505 183200 task_0104_m_00_3 at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366) 060505 183200 task_0104_m_00_3 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) 060505 183200 task_0104_m_00_3 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:129) 060505 183200 task_0104_m_00_3 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755) 060505 183200 task_0104_m_00_3 Caused by: java.lang.InstantiationException: org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage 060505 183200 task_0104_m_00_3 at java.lang.Class.newInstance0(Class.java:335) 060505 183200 task_0104_m_00_3 at java.lang.Class.newInstance(Class.java:303) 060505 183200 task_0104_m_00_3 at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364) We decided that it is impossible in hadoop to have different input/output types for map and reduce. Then we decided to use another scheme. This scheme assumes to run two jobs. First job has map function, second job has reduce task. These jobs have different classes for input and output parameters. New map and reduce will do the same as described above. We'd like to ask you for advice which way is best for tasks like these. Is the second way is good? Are there any other variants to do this better?
JobTrackerInfoServer and nutch*.jar
Why jsp scripts launched under JobTrackerInfoServer do not see classes from из nutch*.jar? How to point the JobTrackerInfoServer to use nutch*.jar?
new parameters
We see new parameters in hadoop-default.xml: dfs.replication.max, dfs.replication.min. What these parameters do mean?
RE: exception
We updated hadoop from trunk branch. But now we get new errors: On tasktarcker side: skiped java.io.IOException: timed out waiting for response at org.apache.hadoop.ipc.Client.call(Client.java:305) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149) at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813) 060427 062708 Client connection to 10.0.0.10:9001 caught: java.lang.RuntimeException: java.lang.ClassNotFoundException: java.lang.RuntimeException: java.lang.ClassNotFoundException: at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170) 060427 062708 Client connection to 10.0.0.10:9001: closing On jobtracker side: skiped 060427 061713 Server handler 3 on 9001 caught: java.lang.IllegalArgumentException: Ar gument is not an array java.lang.IllegalArgumentException: Argument is not an array at java.lang.reflect.Array.getLength(Native Method) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250) skiped -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 27, 2006 12:48 AM To: nutch-dev@lucene.apache.org Subject: Re: exception Importance: High This is a Hadoop DFS error. It could mean that you don't have any datanodes running, or that all your datanodes are full. Or, it could be a bug in dfs. You might try a recent nightly build of Hadoop to see if it works any better. Doug Anton Potehin wrote: What means error of following type : java.rmi.RemoteException: java.io.IOException: Cannot obtain additional block for file /user/root/crawl/indexes/index/_0.prx
update crawldb
How to update info about links already added to db. Particularly we need to update status of some part of links. What classes should we use to read info about each link stored in DB and then update its status? We use Trunc branch of Nutch.
mapred.map.tasks
property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter?
RE: question about crawldb
-Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 18, 2006 8:02 PM To: nutch-dev@lucene.apache.org Subject: Re: question about crawldb Importance: High Anton Potehin wrote: 1.We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final byte STATUS_LINKED = 4; public static final byte STATUS_FETCH_SUCCESS = 5; public static final byte STATUS_FETCH_RETRY = 6; public static final byte STATUS_FETCH_GONE = 7; Though the names of these flags describe their aims, it is not clear completely what they mean and what is the difference between STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example. The STATUS_DB_* codes are used in entries in the crawldb. STATUS_FETCH_* codes are used in fetcher output. STATUS_LINKED is used in parser output for urls that are linked to. A crawldb update combines all of these (the old version of the db, plus fetcher and parser output) to generate a new version of the db, containing only STATUS_DB_* entries. This logic is in CrawlDbReducer. Does that help? Yes ;-) tnx...
question about crawldb
1. We have found these flags in CrawlDatum class: public static final byte STATUS_SIGNATURE = 0; public static final byte STATUS_DB_UNFETCHED = 1; public static final byte STATUS_DB_FETCHED = 2; public static final byte STATUS_DB_GONE = 3; public static final byte STATUS_LINKED = 4; public static final byte STATUS_FETCH_SUCCESS = 5; public static final byte STATUS_FETCH_RETRY = 6; public static final byte STATUS_FETCH_GONE = 7; Though the names of these flags describe their aims, it is not clear completely what they mean and what is the difference between STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example. 2. Where new links are being added into CrawlDB?
mapred branch
Where now placed mapred branch of nutch ?
image search
Somebody try create image search based on nutch ?
Killing lines
There is snippet from TaskTracker log file: 051206 090643 Task task_r_qegmsh timed out. Killing. 051206 090646 Task task_r_qegmsh timed out. Killing. 051206 090649 Task task_r_qegmsh timed out. Killing. 051206 090652 Task task_r_qegmsh timed out. Killing. 051206 090655 Task task_r_qegmsh timed out. Killing. 051206 090658 Task task_r_qegmsh timed out. Killing. 051206 090701 Task task_r_qegmsh timed out. Killing. 051206 090704 Task task_r_qegmsh timed out. Killing. 051206 090707 Task task_r_qegmsh timed out. Killing. 051206 090710 Task task_r_qegmsh timed out. Killing. 051206 090712 task_r_qegmsh 0.04168% reduce copy 051206 091022 task_r_qegmsh 0.14583334% reduce copy 051206 091024 task_r_qegmsh 0.1667% reduce copy Killing lines repeated every 3 seconds, there hundreds of it. What is it?
mapred crawl
We used nutch for whole web crawling. In infinite loop we run tasks: 1) bin/nutch generate db segmentsPath -topN 1 2) bin/nutch fetch segment name 3) bin/nutch updatedb db segment name 4) bin/nutch analyze db segment name 5) bin/nutch index segment name 6) bin/nutch dedup segments dedup.tmp After each iteration we produce new segment and may use it for search. Now we try mapred. How we can use crawl in similar way? We need results in process, but not in the end of crawling (since is very long process - weeks).
About tomcat
We come to decision, we need restart webapp for new results appeared in search. How to this correctly without restarting tomcat? After long work of tomcat, we have too many open files error. May be this is result of restarting of webapp by touch command on web.xml? By now before tomcat starting, we setting max number open files parameter to 4096 (1024 by default), but we think it is not right decision.
jobdetails.jsp and jobtracker.jsp
How to use jobtracker.jsp and jobdetails.jsp? They need tomcat? When I try start jobdetails.jsp with tomcat, it return error: java.lang.NullPointerException at org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp: 53) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3 22) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126 ) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105 ) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne ction(Http11Protocol.java:744) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav a:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo rkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav a:684) at java.lang.Thread.run(Thread.java:595)
RE: jobdetails.jsp and jobtracker.jsp
They not need tomcat? But then, what we must type in browser address? http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ? -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, November 21, 2005 12:46 PM To: nutch-dev@lucene.apache.org Subject: Re: jobdetails.jsp and jobtracker.jsp [EMAIL PROTECTED] wrote: How to use jobtracker.jsp and jobdetails.jsp? They need tomcat? No, but jobdetails.jsp requires a parameter (job_id) - start with jobtracker.jsp, and then follow the links. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: jobdetails.jsp and jobtracker.jsp
Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem.
RE: mapred.map.tasks
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. In nutch-site.xml I specified parameters: 1) On the both machines: property namefs.default.name/name value192.168.0.250:9009/value descriptionThe name of the default file system. Either the literal string local or a host:port for NDFS./description /property property namemapred.job.tracker/name value192.168.0.250:9010/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.tasktracker.tasks.maximum/name value2/value descriptionThe maximum number of tasks that will be run simultaneously by a task tracker. /description /property property namemapred.reduce.tasks/name value2/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property On 192.168.0.250 I started: 2) bin/nutch-daemon.sh start datanode 3) bin/nutch-daemon.sh start namenode 4) bin/nutch-daemon.sh start jobtracker 5) bin/nutch-daemon.sh start tasktracker I created directory seeds and file urls in it. Urls contained 2 links. Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds). Directory was added successfully.. Then I launched command: bin/nutch crawl seeds -depth 2 I a result I received log written by jobtracker: 051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845' 051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518' 051123 053130 Task 'task_m_z66npx' has finished successfully. Log written by tasktracker on 192.168.0.111: .. 051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31 051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31 051110 142607 Task task_m_z66npx is done. Log written by tasktracker on 192.168.0.250: 051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31 051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31 051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31 051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31 051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31 051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31 051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31 051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31 051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31 051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31 051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31 ... and so on... e.g. in this log were records with reducing percents. I concluded that was an attempt to separate inject to 2 machines e.g. were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx' was finished successfully and 'task_m_xaynqo' caused some problems (negative progress). But if I change parameter mapred.reduce.tasks to 4 all tasks finished successfully and all work right. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 22, 2005 2:10 AM To: nutch-dev@lucene.apache.org Subject: Re: mapred.map.tasks [EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal changes to your conf/ directory required to illustrate this, how you start your daemons, etc. Thanks, Doug
rank system
What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with page scores calculating. Does the mapred use ranking system somehow? Is it possible to use mapred for clustering whole-web crawling or it works with Intranet Crawling only?
RE: rank system
Alright i see in crawl/Indexer.java in method reduce object class dbDatum which contain score. But where calculate this score? What formula using when calculate score? -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 1:54 PM To: nutch-dev@lucene.apache.org Subject: Re: rank system Pre score calculation is done in the indexer. Yes it works with complete webcrawls as well, and it works very well for that. :-) Stefan Am 08.11.2005 um 11:22 schrieb Anton Potehin: What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with page scores calculating. Does the mapred use ranking system somehow? Is it possible to use mapred for clustering whole-web crawling or it works with Intranet Crawling only?
questions
After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment 3) Fetching 4) Parse segment 5) Update crawlDb - here the information is added from segment into crawlDb 6) Phase 2 - 5 is repeated several times 7) Link db I can't understand how the clusterization is performed. What phases may be performed parallel on several machines and how jobs may be separated for several machines. What is performed at 7th phase?
RE: questions
Does it mean that every job at every phase may be separated for several machines (for example: generate or every rest phases may be performed parallel on several machines)? Give us URL for presentation on wiki please? -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 4:47 PM To: nutch-dev@lucene.apache.org Subject: Re: questions clustering is done until searching and only the first 200 hists are clustered. parsing is normally done u ntil fetching. map reduce split all jobs into several tasks and reduce the results together. you will find some presentation slides in the wiki, HTH Stefan Am 08.11.2005 um 14:31 schrieb Anton Potehin: After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment 3) Fetching 4) Parse segment 5) Update crawlDb - here the information is added from segment into crawlDb 6) Phase 2 - 5 is repeated several times 7) Link db I can't understand how the clusterization is performed. What phases may be performed parallel on several machines and how jobs may be separated for several machines. What is performed at 7th phase?