spam detect

2007-07-09 Thread anton
Hello!

Does nutch have any modules for spam detect? 
Does anyone know where I can find any information (blogs, articles, FAQ)
about it?



RE: How to get score in search.jsp

2007-02-14 Thread Anton Potekhin
I have found solution. I've add variable score  into Hit

-Original Message-
From: Anton Potekhin [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 14, 2007 10:48 AM
To: nutch-dev@lucene.apache.org
Subject: How to get score in search.jsp
Importance: High

Hi Nutch Gurus!

I have a small problem. I need to add some changes into search.jsp. I need
to get first 50 results and to sort them in different way. I will change the
score of each result with formula new_score = nutch_score +
domain_score_from_my_db to sort. But i don't understand how to get
nutch_score in search.jsp  

Now I use a makeshift. I get the nutch_score using getValue() method of
org.apache.lucene.search.Explanation class. But i think it is a very slow
way. 

Can anybody help me to find a solution for this problem?

P.S. I hope that I described my problem clearly. Thanks in advance.

Sorry for the duplicated mail. I think I had some problems with my mail
account 






How to get score in search.jsp

2007-02-13 Thread Anton Potekhin
Hi Nutch Gurus!

I have a small problem. I need to add some changes into search.jsp. I need
to get first 50 results and to sort them in different way. I will change the
score of each result with formula new_score = nutch_score +
domain_score_from_my_db to sort. But i don't understand how to get
nutch_score in search.jsp  

Now I use a makeshift. I get the nutch_score using getValue() method of
org.apache.lucene.search.Explanation class. But i think it is a very slow
way. 

Can anybody help me to find a solution for this problem?

P.S. I hope that I described my problem clearly. Thanks in advance.

Sorry for the duplicated mail. I think I had some problems with my mail
account 




deep limitation

2006-11-06 Thread anton
Does Nutch 0.7.2 have any deep limitation?

I added a few pages. I need processing this pages and all pages which
located 3 (for example) clicks away from added pages. 

I think, I explain clearly ;-)  




RE: indexing problem

2006-09-07 Thread anton
Nutch is not compatible with latest hadoop from svn.

Nutch works coorect after small tuning with latest hadoop from svn ;-)




indexing problem

2006-09-06 Thread anton
I've got latest versions of nutch (0.9-dev) and hadoop (Trunk) from svn.
When I try to index I get the next error:

java.lang.ClassCastException: org.apache.nutch.parse.ParseData
 at org.apache.nutch.indexer.Indexer$InputFormat$1.next(Indexer.java:92)
 at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:184)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:196)
 at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)

 
This exception is raised from method next(Writable key, Writable value) of
class SequenceFileRecordReader. 

Method 'next' is called with 'value' parameter that have different class for
each its call (classes are crawlDatum, ParseData or Inlinks). 

And when these classes (crawlDatum, ParseData or Inlinks) are cast I get
classCastException.

Why do I get this exception? I looked at old sources but didn't find
distinctions in algorithm. What do I miss?




limitation

2006-09-04 Thread anton
How to limit the pages number processed from each domain? And how to setup
nutch to crawl only domains added by me (i.e. make nutch to ignore external
links)? If nutch doesn't allow it then what algorithm will be the best for
it? 


p.s. nutch ver.0.7
  




Fetch error

2006-08-30 Thread anton
I update hadoop but I am get next error now on fetch step (reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334%
reduce  copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_02_0reduce=1:
java.lang.IllegalStateException
at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse.
java:561)
at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115)
at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190)
at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto
ryImpl.java:115)
at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j
ava:75)
at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
24)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl
er.java:475)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext
.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


How I can fixed this? While on generate step all works right but on fetch
reduce I get error and task faild?





RE: Fetch error

2006-08-30 Thread anton
Preview error I got from tasktracker log. In jobtracker log I am see next
error now:

06/08/30 01:04:07 INFO mapred.TaskInProgress: Error from
task_0001_r_00_1: java.lang.AbstractMethodError: org.apache.n
utch.fetcher.FetcherOutputFormat.getRecordWriter(Lorg/apache/hadoop/fs/FileS
ystem;Lorg/apache/hadoop/mapred/JobConf;Ljava/
lang/String;Lorg/apache/hadoop/util/Progressable;)Lorg/apache/hadoop/mapred/
RecordWriter;
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:297)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1075)



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 30, 2006 12:17 PM
To: nutch-dev@lucene.apache.org
Subject: Fetch error
Importance: High

I update hadoop but I am get next error now on fetch step (reduce):

06/08/29 08:31:20 INFO mapred.TaskTracker: task_0003_r_00_3 0.3334%
reduce  copy (6 of 6 at 11.77 MB/s)
06/08/29 08:31:20 WARN /:
/getMapOutput.jsp?map=task_0003_m_02_0reduce=1:
java.lang.IllegalStateException
at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpResponse.
java:561)
at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:115)
at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:190)
at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspFacto
ryImpl.java:115)
at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryImpl.j
ava:75)
at
org.apache.jsp.getMapOutput_jsp._jspService(getMapOutput_jsp.java:100)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
24)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandl
er.java:475)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext
.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


How I can fixed this? While on generate step all works right but on fetch
reduce I get error and task faild?







RE: problem with nutch

2006-08-25 Thread anton
I tried start job tracker without tomcat. 

-Original Message-
From: Chris Stephens [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 23, 2006 6:16 PM
To: nutch-dev@lucene.apache.org
Subject: Re: problem with nutch
Importance: High

This is probably a better question for the user list.  
nutch-user@lucene.apache.org

It looks like your trying to bind Tomcat to a port that is already being 
used.  Edit your configuration file and change the default port (usually 
8080) to something that is available on that server.

[EMAIL PROTECTED] wrote:
 When I try start nutch 0.8 I get errors. How I can solve this problem? 

 JobTracker log:

 ...Skiped...
 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is
 little
 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 
 06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases
 06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4
 06/08/23 05:19:41 INFO util.Container: Started
 [EMAIL PROTECTED]
 06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/]
 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not
 found
 



   





RE: problem with nutch

2006-08-25 Thread anton
If be exacеt. When I started job tracker on given server was loaded only
namenode. All ports from hadoop-default.xml not used.  

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 25, 2006 10:48 AM
To: nutch-dev@lucene.apache.org
Subject: RE: problem with nutch
Importance: High

I tried start job tracker without tomcat. 

-Original Message-
From: Chris Stephens [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 23, 2006 6:16 PM
To: nutch-dev@lucene.apache.org
Subject: Re: problem with nutch
Importance: High

This is probably a better question for the user list.  
nutch-user@lucene.apache.org

It looks like your trying to bind Tomcat to a port that is already being 
used.  Edit your configuration file and change the default port (usually 
8080) to something that is available on that server.

[EMAIL PROTECTED] wrote:
 When I try start nutch 0.8 I get errors. How I can solve this problem? 

 JobTracker log:

 ...Skiped...
 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is
 little
 06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 
 06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases
 06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4
 06/08/23 05:19:41 INFO util.Container: Started
 [EMAIL PROTECTED]
 06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/]
 06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not
 found
 



   







RE: problem with nutch

2006-08-25 Thread anton
In Addition please draw attention on next part of log: 

06/08/25 05:07:59 WARN servlet.WebApplicationContext: Web application not
found /spider_kakle_mapred/spider/conf:/spider_
06/08/25 05:07:59 WARN servlet.WebApplicationContext: Configuration error on
/spider_kakle_mapred/spider/conf:/spider_kak
java.io.FileNotFoundException:
/spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools.
jar:/spider_ka
at
org.mortbay.jetty.servlet.WebApplicationContext.resolveWebApp(WebApplication
Context.java:266)
at
org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContex
t.java:449)
at org.mortbay.util.Container.start(Container.java:72)
at org.mortbay.http.HttpServer.doStart(HttpServer.java:753)
at org.mortbay.util.Container.start(Container.java:72)
at
org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:172)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:461)
at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1143)
06/08/25 05:07:59 INFO util.Container: Started HttpContext[/logs,/logs]
06/08/25 05:07:59 INFO util.Container: Started HttpContext[/static,/static]
06/08/25 05:07:59 INFO http.SocketListener: Started SocketListener on
0.0.0.0:8010
06/08/25 05:07:59 WARN mapred.JobTracker: Starting tracker
java.io.IOException: Problem starting http server
at
org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:195)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:461)
at
org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:68)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:1143)
Caused by: org.mortbay.util.MultiException[java.io.FileNotFoundException:
/spider_kakle_mapred/spider/conf:/spider_kakle_
at org.mortbay.http.HttpServer.doStart(HttpServer.java:731)




In this I see exception (java.io.FileNotFoundException). What file was not
found? What I can do?  




problem with nutch

2006-08-23 Thread anton
When I try start nutch 0.8 I get errors. How I can solve this problem? 

JobTracker log:

...Skiped...
06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.endian' is
little
06/08/23 05:19:40 INFO mapred.JobTracker: Property 'sun.cpu.isalist' is 
06/08/23 05:19:40 INFO util.Credential: Checking Resource aliases
06/08/23 05:19:40 INFO http.HttpServer: Version Jetty/5.1.4
06/08/23 05:19:41 INFO util.Container: Started
[EMAIL PROTECTED]
06/08/23 05:19:41 INFO util.Container: Started WebApplicationContext[/,/]
06/08/23 05:19:41 WARN servlet.WebApplicationContext: Web application not
found
/spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools.
jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp
ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr
ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l
ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j
ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred
/spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar
:/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider
/lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide
r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j
ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid
er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja
r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid
er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e
l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_
kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp
ider/lib/jetty-ext/jsp-api.jar
06/08/23 05:19:41 WARN servlet.WebApplicationContext: Configuration error on
/spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools.
jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp
ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr
ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l
ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j
ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred
/spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar
:/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider
/lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide
r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j
ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid
er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja
r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid
er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e
l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_
kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp
ider/lib/jetty-ext/jsp-api.jar
java.io.FileNotFoundException:
/spider_kakle_mapred/spider/conf:/spider_kakle_mapred/jdk1.5.0_06/lib/tools.
jar:/spider_kakle_mapred/spider:/spider_kakle_mapred/spider/hadoop-*.jar:/sp
ider_kakle_mapred/spider/lib/commons-cli-2.0-SNAPSHOT.jar:/spider_kakle_mapr
ed/spider/lib/commons-lang-2.1.jar:/spider_kakle_mapred/spider/lib/commons-l
ogging-1.0.4.jar:/spider_kakle_mapred/spider/lib/commons-logging-api-1.0.4.j
ar:/spider_kakle_mapred/spider/lib/concurrent-1.3.4.jar:/spider_kakle_mapred
/spider/lib/hadoop.jar:/spider_kakle_mapred/spider/lib/jakarta-oro-2.0.7.jar
:/spider_kakle_mapred/spider/lib/jetty-5.1.4.jar:/spider_kakle_mapred/spider
/lib/junit-3.8.1.jar:/spider_kakle_mapred/spider/lib/log4j-1.2.13.jar:/spide
r_kakle_mapred/spider/lib/lucene.jar:/spider_kakle_mapred/spider/lib/nutch.j
ar:/spider_kakle_mapred/spider/lib/servlet-api.jar:/spider_kakle_mapred/spid
er/lib/taglibs-i18n.jar:/spider_kakle_mapred/spider/lib/xerces-2_6_2-apis.ja
r:/spider_kakle_mapred/spider/lib/xerces-2_6_2.jar:/spider_kakle_mapred/spid
er/lib/jetty-ext/ant.jar:/spider_kakle_mapred/spider/lib/jetty-ext/commons-e
l.jar:/spider_kakle_mapred/spider/lib/jetty-ext/jasper-compiler.jar:/spider_
kakle_mapred/spider/lib/jetty-ext/jasper-runtime.jar:/spider_kakle_mapred/sp
ider/lib/jetty-ext/jsp-api.jar
   at
org.mortbay.jetty.servlet.WebApplicationContext.resolveWebApp(WebApplication
Context.java:266)
   at
org.mortbay.jetty.servlet.WebApplicationContext.doStart(WebApplicationContex
t.java:449)
   at org.mortbay.util.Container.start(Container.java:72)
   at org.mortbay.http.HttpServer.doStart(HttpServer.java:753)
   at org.mortbay.util.Container.start(Container.java:72)
   at
org.apache.hadoop.mapred.StatusHttpServer.start(StatusHttpServer.java:154)
   at 

some questions

2006-08-18 Thread anton
I suggest to use nutch 0.8 on several computers with DFS. But I'm worried
about nutch's requirements to HDD free space.

For example, suppose I have

1) server with job tracker and namenode
2) 5 servers with task trackers and 20 Gb HDDs
3) 5 servers with datenode and 20 Gb HDDs also (DFS, the replication
will be equal 1)

There are some questions:

1) Is this HDD space enough to run task trackers?

2) How to calculate the approximate free HDD space needed for servers with
task trackers, servers with with job trackers and name node?

3) Will I be able to increase the data storage space while increasing the
number of servers with date node? Or will it not be enough to increase the
number of date nodes?




RE: nutch

2006-08-02 Thread anton
My settings:

property
  namemapred.local.dir/name
  value/hadoop/mapred/local/value
  descriptionThe local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk i/o.
  /description
/property

property
  namemapred.system.dir/name
  value/hadoop/mapred/system/value
  descriptionThe shared directory where MapReduce stores control files.
  /description
/property


My device which mounted onto / have free space is 115G.

[EMAIL PROTECTED] /]# df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/sda2 133G   13G  113G  11% /

Anybody have other ideas?








-Original Message-
From: Sami Siren [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 6:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: nutch
Importance: High

most propably you have run out of space in tmp (local) filesystem

use properties like

property
  namemapred.system.dir/name
  value!-- path to fs that contains a lots of space --/value
/property
property
  namemapred.local.dir/name
  value!-- path to fs that contains a lots of space --/value
/property

in hadoop-site.xml to get over this problem.


[EMAIL PROTECTED] wrote:

I forget ;-) One more question:
This problem with nutch or hadoop?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 02, 2006 11:38 AM
To: nutch-dev@lucene.apache.org
Subject: nutch
Importance: High

I use nutch 0.8(mapred). Nutch started on 3 servers.
When my nutch try index segment I get error on tasktracker:
skiped





  






Problem opening checksum file

2006-06-22 Thread anton
I create file on dfs (for example filename done). After I try copy this
file from dfs to local filesystem. In result I get this file in local
filesystem and error:

Problem opening checksum file: /user/root/crawl/done.  Ignoring with
exception org.apache.hadoop.ipc.RemoteException: jav
a.io.IOException: Cannot open filename /user/root/crawl/done.crc
at org.apache.hadoop.dfs.NameNode.open(NameNode.java:130)
at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm
pl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:240)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:218) 


for create file I use code:
FileSystem fs = ...
fs_.createNewFile(new Path(segments[i], already_indexed));

for copy file to local filesystem I use code:
fs.copyToLocalFile(...,...);

How create crc file? 
Why crc file is not created automatically when making a file on dfs? 
How correctly create a file on dfs?






search speed

2006-06-15 Thread anton
I using dfs. My index contain 3706249 documents. Presently, searching for
occupies from 2 before 4 seconds (I test on query with 3 search term).
Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think
search is very slow now. 
We can make search faster? 
What factors influence on search speed?





free disk space

2006-06-14 Thread anton
I'm using nutch v.0.8 and have 3 computers. Two of them have datanode and
tasktracker running, another one has name node and jobtracker running.  Do I
need more disk space with tasktrackers and jobtracker running, as  the
number of pages processed is growing along with the size of database? Would
I be able to add the 3d datanode when I run out of free disk space on those
computers with datanode installed?

How much free disk space do I need in order for task- and jobtrackers to
work properly?




No space left on device

2006-06-14 Thread anton

I'm using nutch v.0.8 and have 3 computers.
One of my tasktrakers always go down. 
This occurs during indexing (index crawl/indexes). On server with crashed
tasktracker now available 53G of free disk space and used only 11G.
How i can decide this problem? Why tasktarcker requires so much free space
on HDD?

Piece of Log with error:

060613 151840 task_0083_r_01_0 0.5% reduce  sort
060613 151841 task_0083_r_01_0 0.5% reduce  sort
060613 151842 task_0083_r_01_0 0.5% reduce  sort
060613 151843 task_0083_r_01_0 0.5% reduce  sort
060613 151844 task_0083_r_01_0 0.5% reduce  sort
060613 151845 task_0083_r_01_0 0.5% reduce  sort
060613 151846 task_0083_r_01_0 0.5% reduce  sort
060613 151847 task_0083_r_01_0 0.5% reduce  sort
060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left on
device
060613 151847 task_0083_r_01_0  SEVERE FSError from child
060613 151847 task_0083_r_01_0 org.apache.hadoop.fs.FSError:
java.io.IOException: No space left on device
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Syst
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
:69)
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
am.j
060613 151847 task_0083_r_01_0  at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
060613 151847 task_0083_r_01_0  at
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
060613 151847 task_0083_r_01_0  at
java.io.DataOutputStream.flush(DataOutputStream.java:106)
060613 151847 task_0083_r_01_0  at
java.io.FilterOutputStream.close(FilterOutputStream.java:140)
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59
8)
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533)
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519)
060613 151847 task_0083_r_01_0  at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
060613 151847 task_0083_r_01_0  060613 151847 task_0083_r_01_0
at org.apache.hadoop.mapred.TaskTracker$Chi
060613 151847 task_0083_r_01_0 Caused by: java.io.IOException: No space
left on device
060613 151847 task_0083_r_01_0  at
java.io.FileOutputStream.writeBytes(Native Method)
060613 151847 task_0083_r_01_0  at
java.io.FileOutputStream.write(FileOutputStream.java:260)
060613 151848 task_0083_r_01_0  at
org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
Syst
060613 151848 task_0083_r_01_0  ... 11 more
060613 151849 Server connection on port 50050 from 10.0.0.3: exiting
060613 151854 task_0083_m_01_0 done; removing files.
060613 151855 task_0083_m_03_0 done; removing files.





RE: No space left on device

2006-06-14 Thread anton
Yes, I use dfs. 
How configure nutch for decide problem with disk space? How control number
of smaller files? 

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 14, 2006 5:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: No space left on device
Importance: High

The tasktracker require intermediate space while performing the map 
and reduce functions.  Many smaller files are produced during the map 
and reduce processes that are deleted when the processes finish.  If you 
are using the DFS then more disk space is required then is actually used 
since disk space is grabbed in blocks.

Dennis

[EMAIL PROTECTED] wrote:
 I'm using nutch v.0.8 and have 3 computers.
 One of my tasktrakers always go down. 
 This occurs during indexing (index crawl/indexes). On server with crashed
 tasktracker now available 53G of free disk space and used only 11G.
 How i can decide this problem? Why tasktarcker requires so much free space
 on HDD?

 Piece of Log with error:

 060613 151840 task_0083_r_01_0 0.5% reduce  sort
 060613 151841 task_0083_r_01_0 0.5% reduce  sort
 060613 151842 task_0083_r_01_0 0.5% reduce  sort
 060613 151843 task_0083_r_01_0 0.5% reduce  sort
 060613 151844 task_0083_r_01_0 0.5% reduce  sort
 060613 151845 task_0083_r_01_0 0.5% reduce  sort
 060613 151846 task_0083_r_01_0 0.5% reduce  sort
 060613 151847 task_0083_r_01_0 0.5% reduce  sort
 060613 151847 SEVERE FSError, exiting: java.io.IOException: No space left
on
 device
 060613 151847 task_0083_r_01_0  SEVERE FSError from child
 060613 151847 task_0083_r_01_0 org.apache.hadoop.fs.FSError:
 java.io.IOException: No space left on device
 060613 151847 task_0083_r_01_0  at

org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
 Syst
 060613 151847 task_0083_r_01_0  at

org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java
 :69)
 060613 151847 task_0083_r_01_0  at

org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStre
 am.j
 060613 151847 task_0083_r_01_0  at
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 060613 151847 task_0083_r_01_0  at
 java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 060613 151847 task_0083_r_01_0  at
 java.io.DataOutputStream.flush(DataOutputStream.java:106)
 060613 151847 task_0083_r_01_0  at
 java.io.FilterOutputStream.close(FilterOutputStream.java:140)
 060613 151847 task_0083_r_01_0  at

org.apache.hadoop.io.SequenceFile$Sorter$SortPass.close(SequenceFile.java:59
 8)
 060613 151847 task_0083_r_01_0  at
 org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:533)
 060613 151847 task_0083_r_01_0  at
 org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:519)
 060613 151847 task_0083_r_01_0  at
 org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:316)
 060613 151847 task_0083_r_01_0  060613 151847 task_0083_r_01_0
 at org.apache.hadoop.mapred.TaskTracker$Chi
 060613 151847 task_0083_r_01_0 Caused by: java.io.IOException: No
space
 left on device
 060613 151847 task_0083_r_01_0  at
 java.io.FileOutputStream.writeBytes(Native Method)
 060613 151847 task_0083_r_01_0  at
 java.io.FileOutputStream.write(FileOutputStream.java:260)
 060613 151848 task_0083_r_01_0  at

org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFile
 Syst
 060613 151848 task_0083_r_01_0  ... 11 more
 060613 151849 Server connection on port 50050 from 10.0.0.3: exiting
 060613 151854 task_0083_m_01_0 done; removing files.
 060613 151855 task_0083_m_03_0 done; removing files.



   




RE: resolving IP in...

2006-06-07 Thread anton

Anyone knows where can I download the nutch version 0.8? I can't find this
one :(


http://svn.apache.org/repos/asf/lucene/nutch/trunk/




summary

2006-06-05 Thread anton


My Nutch processed pages
http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and
http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm.


When I try make search for search term lingerie nutch bring up results
with bad summary (... Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie ...). 

Please, help me in the decision of my problem...




RE: summary

2006-06-05 Thread anton

It's not a problem of Nutch! 
Do you Try a spamdexing ?

Yes. I understand this... But how fight with this spam? 


-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Envoyé : lundi 5 juin 2006 11:43
À : nutch-dev@lucene.apache.org
Objet : summary



My Nutch processed pages
http://www.abc-internet.net/lavinia-lingerie/Lingerie.htm and
http://www.abc-internet.net/pamperedpassions-pampered_passions/Lingerie.htm.


When I try make search for search term lingerie nutch bring up results
with bad summary (... Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie, Lingerie,
Lingerie, Lingerie ...). 

Please, help me in the decision of my problem...






error

2006-05-22 Thread anton
I updated any plugins... And now I get errors in tomcat log: 

May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init
SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin
(summary-basic), extension point: org.apache.nutch.searcher.Summarizer does
not exist.

How fix this problem?





to count the number of pages from each domain

2006-05-05 Thread anton
We tried to develop a solution to count the number of pages from each
domain.

We thought to do it so: 

.map - had following input k - UTF8 (url of page), v - CrawlDatum and
following output k - UTF8 (domain of page), v - UrlAndPage implemented
Writable (structure which contained url of page and its CrawlDatum)   

.reduce - had following input k - UTF8 (domain of page), v - iterator for
list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum

.in map function we parsed domain from url, created UrlAndPage structure and
put them to OutputCollector

.in reduce we counted how many elements are in list connected with iterator,
and put it into each CrawlDatum, then we formed new pairs of k, v (url,
CrawlDatum) and put them to OutputCollector
 

Following problem has arisen: as far as we see the types of input and output
of map and reduce should be same, but in our case they were different and it
caused the error like this: 

060505 183200 task_0104_m_00_3 java.lang.RuntimeException:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAn

dPage

060505 183200 task_0104_m_00_3  at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366)

060505 183200 task_0104_m_00_3  at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)

060505 183200 task_0104_m_00_3  at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)

060505 183200 task_0104_m_00_3  at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755)

060505 183200 task_0104_m_00_3 Caused by:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage

060505 183200 task_0104_m_00_3  at
java.lang.Class.newInstance0(Class.java:335)

060505 183200 task_0104_m_00_3  at
java.lang.Class.newInstance(Class.java:303)

060505 183200 task_0104_m_00_3  at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364)

 

We decided that it is impossible in hadoop to have different input/output
types for map and reduce. Then we decided to use another scheme. This scheme
assumes to run two jobs. First job has map function, second job has reduce
task. These jobs have different classes for input and output parameters. New
map and reduce will do the same as described above.  

 

 

We'd like to ask you for advice which way is best for tasks like these. Is
the second way is good? Are there any other variants to do this better? 






JobTrackerInfoServer and nutch*.jar

2006-05-01 Thread anton
Why jsp scripts launched under JobTrackerInfoServer do not see classes from
из nutch*.jar? How to point the JobTrackerInfoServer to use nutch*.jar?




new parameters

2006-04-28 Thread anton
We see new parameters in hadoop-default.xml: dfs.replication.max,
dfs.replication.min.
What these parameters do mean?




RE: exception

2006-04-27 Thread anton
We updated hadoop from trunk branch. But now we get new errors:

On tasktarcker side:
skiped
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149)
at org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown
Source)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813)
060427 062708 Client connection to 10.0.0.10:9001 caught:
java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
060427 062708 Client connection to 10.0.0.10:9001: closing


On jobtracker side:
skiped
060427 061713 Server handler 3 on 9001 caught:
java.lang.IllegalArgumentException: Ar
gument is not an array
java.lang.IllegalArgumentException: Argument is not an array
at java.lang.reflect.Array.getLength(Native Method)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250)
skiped

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 27, 2006 12:48 AM
To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High

This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could be 
a bug in dfs.  You might try a recent nightly build of Hadoop to see if 
it works any better.

Doug

Anton Potehin wrote:
 What means error of following type :
 
  
 
 java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
 block for file /user/root/crawl/indexes/index/_0.prx
 
  
 
  
 
 




update crawldb

2006-04-24 Thread Anton Potehin
How to update info about links already added to db. Particularly we need
to update status of some part of links. What classes should we use to
read info about each link stored in DB and then update its status? We
use Trunc branch of Nutch. 

 



mapred.map.tasks

2006-04-20 Thread Anton Potehin
property

  namemapred.map.tasks/name

  value2/value

  descriptionThe default number of map tasks per job.  Typically set

  to a prime several times greater than number of available hosts.

  Ignored when mapred.job.tracker is local.  

  /description

/property

 

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 

Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  

 



RE: question about crawldb

2006-04-19 Thread anton


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 18, 2006 8:02 PM
To: nutch-dev@lucene.apache.org
Subject: Re: question about crawldb
Importance: High

Anton Potehin wrote:
 1.We have found these flags in CrawlDatum class: 
 
   public static final byte STATUS_SIGNATURE = 0;
   public static final byte STATUS_DB_UNFETCHED = 1;
   public static final byte STATUS_DB_FETCHED = 2;
   public static final byte STATUS_DB_GONE = 3;
   public static final byte STATUS_LINKED = 4;
   public static final byte STATUS_FETCH_SUCCESS = 5;
   public static final byte STATUS_FETCH_RETRY = 6;
   public static final byte STATUS_FETCH_GONE = 7;
 
 Though the names of these flags describe their aims, it is not clear
 completely what they mean and what is the difference between
 STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.

The STATUS_DB_* codes are used in entries in the crawldb. 
STATUS_FETCH_* codes are used in fetcher output.  STATUS_LINKED is used 
in parser output for urls that are linked to.  A crawldb update combines 
all of these (the old version of the db, plus fetcher and parser output) 
to generate a new version of the db, containing only STATUS_DB_* 
entries.  This logic is in CrawlDbReducer.

Does that help?

Yes ;-) tnx...




question about crawldb

2006-04-18 Thread Anton Potehin
1.  We have found these flags in CrawlDatum class: 

  public static final byte STATUS_SIGNATURE = 0;

  public static final byte STATUS_DB_UNFETCHED = 1;

  public static final byte STATUS_DB_FETCHED = 2;

  public static final byte STATUS_DB_GONE = 3;

  public static final byte STATUS_LINKED = 4;

  public static final byte STATUS_FETCH_SUCCESS = 5;

  public static final byte STATUS_FETCH_RETRY = 6;

  public static final byte STATUS_FETCH_GONE = 7;

Though the names of these flags describe their aims, it is not clear
completely what they mean and what is the difference between
STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.

 

 

2.  Where new links are being added into CrawlDB? 

 



mapred branch

2006-04-10 Thread Anton Potehin
Where now placed mapred branch of nutch ?



image search

2006-04-10 Thread Anton Potehin
Somebody try create image search based on nutch ?



Killing lines

2005-12-06 Thread anton
There is snippet from TaskTracker log file:

051206 090643 Task task_r_qegmsh timed out.  Killing.
051206 090646 Task task_r_qegmsh timed out.  Killing.
051206 090649 Task task_r_qegmsh timed out.  Killing.
051206 090652 Task task_r_qegmsh timed out.  Killing.
051206 090655 Task task_r_qegmsh timed out.  Killing.
051206 090658 Task task_r_qegmsh timed out.  Killing.
051206 090701 Task task_r_qegmsh timed out.  Killing.
051206 090704 Task task_r_qegmsh timed out.  Killing.
051206 090707 Task task_r_qegmsh timed out.  Killing.
051206 090710 Task task_r_qegmsh timed out.  Killing.
051206 090712 task_r_qegmsh 0.04168% reduce  copy 
051206 091022 task_r_qegmsh 0.14583334% reduce  copy 
051206 091024 task_r_qegmsh 0.1667% reduce  copy 


Killing lines repeated every 3 seconds, there hundreds of it.
What is it?




mapred crawl

2005-11-23 Thread Anton Potehin
We used nutch for whole web crawling.

In infinite loop we run tasks:

1) bin/nutch generate db segmentsPath -topN 1

2) bin/nutch fetch segment name

3) bin/nutch updatedb db segment name  

4) bin/nutch analyze db segment name

5) bin/nutch index segment name

6) bin/nutch dedup segments dedup.tmp

 

After each iteration we produce new segment and may use it for search.

 

Now we try mapred. How we can use crawl in similar way? We need results
in process, but not in the end of crawling (since is very long process -
weeks).

 



About tomcat

2005-11-21 Thread Anton Potehin
We come to decision, we need restart webapp for new results appeared in search. 
How to this correctly without restarting tomcat?

 

After long work of tomcat,  we have too many open files error. May be this is 
result of restarting of webapp by touch command on web.xml? By now before 
tomcat starting, we setting max number open files parameter to 4096 (1024 by 
default), but we think it is not right decision.

 

 

 



jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat? 

When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
22)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne
ction(Http11Protocol.java:744)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
at java.lang.Thread.run(Thread.java:595) 




RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
They not need tomcat? But then, what we must type in browser address? 

http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ?


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 21, 2005 12:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: jobdetails.jsp and jobtracker.jsp

[EMAIL PROTECTED] wrote:

How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat? 
  


No, but jobdetails.jsp requires a parameter (job_id) - start with 
jobtracker.jsp, and then follow the links.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got negative progress
percentages problem.




RE: mapred.map.tasks

2005-11-21 Thread anton
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
property
  namefs.default.name/name
  value192.168.0.250:9009/value
  descriptionThe name of the default file system.  Either the
  literal string local or a host:port for NDFS./description
/property

property
  namemapred.job.tracker/name
  value192.168.0.250:9010/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property
property
  namemapred.map.tasks/name
  value2/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.  
  /description
/property

property
  namemapred.tasktracker.tasks.maximum/name
  value2/value
  descriptionThe maximum number of tasks that will be run
  simultaneously by a task tracker.
  /description
/property

property
  namemapred.reduce.tasks/name
  value2/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property
 



On 192.168.0.250 I started:
2)   bin/nutch-daemon.sh start datanode
3)   bin/nutch-daemon.sh start namenode
4)   bin/nutch-daemon.sh start jobtracker
5)   bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

 

Then I launched command: 
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:

051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
 

Log written by tasktracker on 192.168.0.111:
..
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
 

Log written by tasktracker on 192.168.0.250:

051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

 

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

[EMAIL PROTECTED] wrote:
 Why we need parameter mapred.map.tasks greater than number of available
 host? If we set it equal to number of host, we got negative progress
 percentages problem.

Can you please post a simple example that demonstrates the negative 
progress problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug




rank system

2005-11-08 Thread Anton Potehin
What about scoring in mapred? I have looked crawl/crawl.java but I did
not found anything concerned with page scores calculating. Does the
mapred use ranking system somehow? 

Is it possible to use mapred for clustering whole-web crawling or it
works with Intranet Crawling only?

 



RE: rank system

2005-11-08 Thread anton
Alright i see in crawl/Indexer.java in method reduce object class dbDatum
which contain score. But where calculate this score?  
What formula using when calculate score?

-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 08, 2005 1:54 PM
To: nutch-dev@lucene.apache.org
Subject: Re: rank system

Pre score calculation is done in the indexer.
Yes it works with complete webcrawls as well, and it works very well  
for that. :-)

Stefan

Am 08.11.2005 um 11:22 schrieb Anton Potehin:

 What about scoring in mapred? I have looked crawl/crawl.java but I did
 not found anything concerned with page scores calculating. Does the
 mapred use ranking system somehow?

 Is it possible to use mapred for clustering whole-web crawling or it
 works with Intranet Crawling only?








questions

2005-11-08 Thread Anton Potehin
After I looked thru Crawl.java I exploded all tasks for several phases:

1)   Inject - here we add web-links into crawlDb

2)   Generate segment - here we create data segment

3)   Fetching

4)   Parse segment

5)   Update crawlDb - here the information is added from segment
into crawlDb

6)   Phase 2 - 5 is repeated several times

7)   Link db

 

I can't understand how the clusterization is performed. What phases may
be performed parallel on several machines and how jobs may be separated
for several machines. 

What is performed at 7th phase? 

 



RE: questions

2005-11-08 Thread anton
Does it mean that every job at every phase may be separated for several
machines (for example: generate or every rest phases may be performed
parallel on several machines)? 

Give us URL for presentation on wiki please?

-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 08, 2005 4:47 PM
To: nutch-dev@lucene.apache.org
Subject: Re: questions 

clustering is done until searching and only the first 200 hists are  
clustered.
parsing is normally done u ntil fetching.
map reduce split all jobs into several tasks and reduce the results  
together.
you will find some presentation slides in the wiki,
HTH
Stefan
Am 08.11.2005 um 14:31 schrieb Anton Potehin:

 After I looked thru Crawl.java I exploded all tasks for several  
 phases:

 1)   Inject - here we add web-links into crawlDb

 2)   Generate segment - here we create data segment

 3)   Fetching

 4)   Parse segment

 5)   Update crawlDb - here the information is added from segment
 into crawlDb

 6)   Phase 2 - 5 is repeated several times

 7)   Link db



 I can't understand how the clusterization is performed. What phases  
 may
 be performed parallel on several machines and how jobs may be  
 separated
 for several machines.

 What is performed at 7th phase?