max-out links count in Nutch.

2005-12-27 Thread K.A.Hussain Ali

HI all.

Do the  db.max.outlinks.per.page value in the Nutch-default.xml has limitation ?

when i crawl using the default value of 100 it fail to get many links ?

Do this value controls the number of links to be fetched from a page ?

Any suggestion would greatly help.
Thanks in advance

regards
-Hussain



Re: Out of memor error while updating

2005-12-27 Thread Stefan Groschupf

Should i change the value of   'io.sort.mb' and or io.sort.factor ?
and if so what should i change to so to eliminate the  error?

Yes, since it looks like it crah until sorting.


Also is there any minimum requirement of RAM for nutch to do  
indexing and searching ?


Well, not really but you should have 1 GB RAM if you want to do  
serious things.

You can setup the memory:
from the bin/nutch script:
#   NUTCH_HEAPSIZE  The maximum amount of heap to use, in MB.
#   Default is 1000.

...
JAVA_HEAP_MAX=-Xmx1000m

HTH
Stefan

Any help is greatly appreciated
Thanks in advance

regards
-Hussain.



- Original Message - From: Stefan Groschupf [EMAIL PROTECTED] 
style.com

To: nutch-user@lucene.apache.org
Sent: Monday, December 26, 2005 7:18 PM
Subject: Re: Out of memor error while updating



Do you have a stack trace?
Is it may related to a 'too many file open Exception?'.
Also you can try to minimalize 'io.sort.mb' and or io.sort.factor.

Stefan

Am 26.12.2005 um 09:27 schrieb K.A.Hussain Ali:


HI all,

I am using Nutch to crawl few sites and when i crawl for certain
depth and do updation of webdb

while updating the webdb i get an Out of Memory error

I increased the jvm size using java_opts and even reduced the token
size of per page in the nutch-default.xml but still i get such an
error.

I am using tomcat and i have only one application running on it.

what is the system requirement of Nutch to get rid of this error ?

I even tried things mentioned in the mailing list but nothing turns
to be fruitful.

Any help is greatly appreciated.
Thanks in advance

regards
-Hussain.


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net









Re: How to run Nutch?

2005-12-27 Thread carmmello
This is the error message that I get:

[EMAIL PROTECTED] nutch-nightly]# bin/start-all.sh
cat: /root/.slaves: Arquivo ou diretório não encontrado
starting namenode, logging to /usr/nutch-nightly/nutch-root-namenode-
localhost.l
ocaldomain.log
051227 085214 parsing file:/usr/nutch-nightly/conf/nutch-default.xml
051227 085214 parsing file:/usr/nutch-nightly/conf/nutch-site.xml
Exception in thread main java.lang.RuntimeException: Not a host:port
pair: local
at org.apache.nutch.ndfs.DataNode.createSocketAddr
(DataNode.java:54)
at org.apache.nutch.ndfs.NameNode.init(NameNode.java:52)
at org.apache.nutch.ndfs.NameNode.main(NameNode.java:349)
starting jobtracker, logging to /usr/nutch-nightly/nutch-root-
jobtracker-localhost.localdomain.log
051227 085215 parsing file:/usr/nutch-nightly/conf/nutch-default.xml
051227 085215 parsing file:/usr/nutch-nightly/conf/nutch-site.xml
Exception in thread main java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress
(JobTracker.java:254)
at
org.apache.nutch.mapred.JobTracker.init(JobTracker.java:228)
at org.apache.nutch.mapred.JobTracker.startTracker
(JobTracker.java:45)
at org.apache.nutch.mapred.JobTracker.main(JobTracker.java:1070)
cat: /root/.slaves: Arquivo ou diretório não encontrado
[EMAIL PROTECTED] nutch-nightly]#

My nutch-site.xml is the saved nutch-default.xml, without modifications:


..
property
  namefs.default.name/name
  valuelocal/value
  descriptionThe name of the default file system.  Either the
  literal string local or a host:port for NDFS./description
/property

property
  namendfs.datanode.port/name
  value50010/value
  descriptionThe port number that the ndfs datanode server uses as a
starting  point to look for a free port to listen on.
/description
/property

property
  namendfs.name.dir/name
  value/tmp/nutch/ndfs/name/value
  descriptionDetermines where on the local filesystem the NDFS name
node
  should store the name table./description
/property

property
  namendfs.data.dir/name
  value/tmp/nutch/ndfs/data/value
  descriptionDetermines where on the local filesystem an NDFS data
node
  should store its blocks.  If this is a comma- or space-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices./description
/property

!-- map/reduce properties --

property
  namemapred.job.tracker/name
  valuelocal/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

property
  namemapred.job.tracker.info.port/name
  value50030/value
  descriptionThe port that the MapReduce job tracker info webserver
runs at.
  /description
/property

property
  namemapred.task.tracker.output.port/name
  value50040/value
  descriptionThe port number that the MapReduce task tracker output
server uses as a starting
   point to look for a free port to listen on.
  /description
/property

property
  namemapred.task.tracker.report.port/name
  value50050/value
  descriptionThe port number that the MapReduce task tracker report
server uses as a starting
   point to look for a free port to listen on.
  /description
/property

property
  namemapred.local.dir/name
  value/tmp/nutch/mapred/local/value
  descriptionThe local directory where MapReduce stores intermediate
  data files.  May be a space- or comma- separated list of
  directories on different devices in order to spread disk i/o.
  /description
/property

property
  namemapred.system.dir/name
  value/tmp/nutch/mapred/system/value
  descriptionThe shared directory where MapReduce stores control
files.
  /description
/property

property
  namemapred.temp.dir/name
  value/tmp/nutch/mapred/temp/value
  descriptionA shared directory for temporary files.
  /description
/property
 
...

As a final reminder, if that matters, this computer is on a small
network (with a router) with another computer that runs another OS
performing other tasks.

Thank you for your attention







Re: How to run Nutch?

2005-12-27 Thread Stefan Groschupf

Do have a one or a muli machine installation planed?

Am 27.12.2005 um 13:05 schrieb carmmello:


Things got better.
Using webapps directly under nutch-nightly and using local: 
5 in

the nutch-site.xml, I got:

[EMAIL PROTECTED] nutch-nightly]# bin/start-all.sh
cat: /root/.slaves: Arquivo ou diretório não encontrado (file or  
folder

not found)
starting namenode, logging to /usr/nutch-nightly/nutch-root-namenode-
localhost.localdomain.log
starting jobtracker, logging to /usr/nutch-nightly/nutch-root-
jobtracker-localhost.localdomain.log
cat: /root/.slaves: Arquivo ou diretório não encontrado (file or  
folder

not found)
[EMAIL PROTECTED] nutch-nightly]#







document markup to control indexing

2005-12-27 Thread Jeff Breidenbach

Hi all,

Another open source search engine, HtDig, allows web page authors to
mark up a page such that some sections are not indexed.  The syntax
looks like the following:

!--htdig_noindex--
... material inside is not indexed ...
!--/htdig_noindex--

Does a similar feature exist in Nutch? If the answer is write a
plugin does anyone have tips on where to start? Also, how hard is
something like this for a Nutch newbie who doesn't know anything about
HTML parsing? I have a bunch of documents already marked up with the
htdig syntax, and in the interests of interoperability I'm tempted to
follow the syntax exactly.

-Jeff


Re: document markup to control indexing

2005-12-27 Thread Jack Tang
Hi Jeff

Pls refer to getText() method in
org.apache.nutch.parse.html.DOMContentUtils class (of course
parse-html plugin). You can add your filter easily;)

/Jack

On 12/27/05, Jeff Breidenbach jeff@jab.org wrote:

 Hi all,

 Another open source search engine, HtDig, allows web page authors to
 mark up a page such that some sections are not indexed.  The syntax
 looks like the following:

 !--htdig_noindex--
 ... material inside is not indexed ...
 !--/htdig_noindex--

 Does a similar feature exist in Nutch? If the answer is write a
 plugin does anyone have tips on where to start? Also, how hard is
 something like this for a Nutch newbie who doesn't know anything about
 HTML parsing? I have a bunch of documents already marked up with the
 htdig syntax, and in the interests of interoperability I'm tempted to
 follow the syntax exactly.

 -Jeff



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: document markup to control indexing

2005-12-27 Thread Jeff Breidenbach

Pls refer to getText() method in
org.apache.nutch.parse.html.DOMContentUtils class (of course
parse-html plugin). You can add your filter easily;)


Wow! That was really easy. Thanks.
--Jeff


Re: How to run Nutch?

2005-12-27 Thread carmmello
|Do have a one or a muli machine installation planed?

Just one machine.



Re: Distributed search corrupted output problem

2005-12-27 Thread Stefan Groschupf

Ed,
it is definitely not a encoding problem with rpc calls.
Following test pass on my box. It would be interesting to find the  
problem but setting up a distributed system to verify your problem is  
too time expansive.

Can you try using the latest sources and check if this still occurs?
I will read some more code and see if I can find anything that is  
like a problem.
It would be great if one from the community can verify if this is  
really a bug and if it reproducible.
That search results using distribute search are different is a known  
problem (see jira).
Can you provide a secodn tomcat running on a other port or may just a  
other tomcat context running a nutch ui pointing to a local index?



Stefan  


/**
* Copyright 2005 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the License);
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or  
implied.

* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.nutch.ipc;

import java.lang.reflect.Method;
import java.net.InetSocketAddress;

import junit.framework.TestCase;

import org.apache.nutch.io.UTF8;

public class TestEncoding extends TestCase {

private int PORT = 50232;

private String TEXT = 座頭市; // no idea what this means :)

public void testEncoding() throws Exception {
Server server = RPC.getServer(new HelloWorld(), PORT);
server.start();

Method method = HelloWorld.class.getMethod(helloWorld,
new Class[] { UTF8.class });

Object[][] parameter = new Object[1][1];
parameter[0][0] = new UTF8(TEXT);

UTF8[] values = (UTF8[]) RPC.call(method, parameter,
new InetSocketAddress[] { new InetSocketAddress 
(127.0.0.1,

PORT) });

assertEquals(TEXT, values[0].toString());

}

class HelloWorld {
public UTF8 helloWorld(UTF8 utf8) {
return utf8;
}
}
}


Am 27.12.2005 um 05:38 schrieb Ed Whittaker:


Hi,

I'm running nutch-0.7.1 on a couple of RedHat-9 linux machines. When I
execute catalina.sh start in the crawl directory (i.e. not using
distributed search) and query with a 2 Kanji Japanese string  
everything
works fine, i.e. the pages seem relevant and the output is in the  
correct

encoding.

However, when I run a distributed search using one search server  
specified
in search-servers.txt and the same index as used above, the  
*returned pages
are not the same* and the *output is corrupted*. To see an example  
of this

go to:

http://asked.ru/search.jsp?query=%E6%9D%B1%E4%BA%AC

This queries nutch with the string for Tokyo in Japanese.  
Unfortunately, I
can't provide access to an example of the working (non-distributed)  
setup

but trust me it looks good.

Note, this is not a problem concerning the Tomcat integration with  
Apache
since accessing the distributed search setup via http://localhost: 
8080 gives
identical (corrupted) output to what you'll get if you click on the  
above

link.

I would guess this is some socket encoding problem since that is  
ostensibly

the only difference in the 2 configurations, isn't it?

Does anyone have a distributed search setup which doesn't have these
encoding problems? i.e. is it something wrong with my setup  
somewhere. Or,

is this a known bug?

-Ed




Re: How to run Nutch?

2005-12-27 Thread carmmello
Stefan Wrote:

OK, you are somehow trying to get map reduce multimachine installation
running on one machine, that of course will fail.
Just download or build a 0.8 release.
decompress the archive into a folder called nutch-0.8.
than try:
cd nutch-0.8
bin/nutch
the result should looks like:
Usage: nutch COMMAND
where COMMAND is one of:
  crawl one-step crawler for intranets
  readdbread / dump crawl db
  readlinkdbread / dump link db
  admin database administration, including creation

Than you can start with the crawling command, you do not need any
configuration change for now!!!




I have tried this at the begining, but this does not work. Just see my
original post, that initiated this topic.
Tank you for all your attention.






multibyte character support status

2005-12-27 Thread Teruhiko Kurosaka
What is the current state and plan for multibyte
character support by Nutch?

As far as I can tell...

The PDF plugin uses PDFBox (www.pdfbox.org) which does not
work with Japanese and probably other multibyte characters
and code sets.

The Word plugin uses POI (http://jakarta.apache.org/poi/),
which doesn't seem to support Japanese. Some patches to
make it possible to support Japanese (and hopefully other
code sets) have been submitted to the POI project but
they have not been integrated because the project currently
has no committer.

RTF document plugin and PowerPoint plugin use home-grown
parsers.  What is the status of multibyte code set
(and single byte code set other than ISO-8859-1) support by
these plugins?

-Kuro


Re: file to http mapping

2005-12-27 Thread Jeff Breidenbach

Thank you, this approach worked nicely.

On Tue, 27 Dec 2005 3:03 am, Stefan Groschupf wrote:

Jeff,
no such solution does not exists.
Take a look to the index-filters.
I suggest following solution:
+ write a index filter that add a field 'webserverUrl' to the lucene  
document that contains your rewritten  url (should by from type  
keyword).
+ change the jsp page such that the 'webserverUrl' is used as link  url 
instead the real url.
Chainging the url behind the sense make no sense since nutch uses the  
url as key for all records.


HTH
Stefan


How can I set a search server over NDFS

2005-12-27 Thread Gal Nitzan
Hi,

I have tried all available samples but was unsuccessful.


I am using the following command to start the server:

bin/nutch-daemon.sh start server 9003 crawl

I have setup a directory /hosts with the file search-servers.txt which
contains localhost 9003

but the tomcat client does not connect to my search server at all.

Any idea what am i doing wrong?

Gal




Re: Trouble setting NDFS on multiple machines

2005-12-27 Thread Stefan Groschupf
The exception means that one client is unable to connect to one  
*datanode*.
Check that the box that had this exception can open a connection to  
all other datanodes with the correct port.

try
telnet machineNameAsUsedInNameNode DATANODE_PORT

Is it able to connect?

Stefan

Am 27.12.2005 um 22:20 schrieb Gal Nitzan:


Hi,

For some reason I am having trouble setting NDFS on multiple  
machines I

keep on getting an exception.

My settings follows the guide lines i.e Doug's cheat sheet on all  
three

machines:

  namefs.default.name/name
  valuenutchmst1.XX.com:9000/value

all machines seems to be connecting to the namenode:

051227 223242 10 Opened server at 50010
051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data
051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec
051227 223242 12 Client connection to x.x.22.185:9000: starting

051227 230013 Server connection on port 9000 from x.x.22.186: starting
051227 230013 Got brand-new heartbeat from nutchnd1:50010
051227 230013 Block report from nutchnd1:50010: 0 blocks.
051227 230013 Server connection on port 9000 from x.x.22.183: starting
051227 230013 Got brand-new heartbeat from nutchws1:50010
051227 230013 Block report from nutchws1:50010: 0 blocks.

The problem:::

[EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt
051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
defaul

t.xml
051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
site.x

ml
051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000
051227 230324 Client connection to x.x.22.185:9000: starting
Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)
[EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt  
urls

/urls.txt
051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
defaul

t.xml
051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
site.x

ml
051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000
051227 230423 Client connection to x.x.22.185:9000: starting
Exception in thread main java.lang.NullPointerException
at java.net.Socket.init(Socket.java:357)
at java.net.Socket.init(Socket.java:207)
at org.apache.nutch.ndfs.NDFSClient 
$NDFSOutputStream.nextBlockOu

tputStream(NDFSClient.java:573)
at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init 
(NDFS

Client.java:521)
at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83)
at org.apache.nutch.fs.NDFSFileSystem.createRaw 
(NDFSFileSystem.j

ava:71)
at org.apache.nutch.fs.NFSDataOutputStream$Summer.init 
(NFSData

OutputStream.java:41)
at org.apache.nutch.fs.NFSDataOutputStream.init 
(NFSDataOutputS

tream.java:129)
at org.apache.nutch.fs.NutchFileSystem.create 
(NutchFileSystem.ja

va:187)
at org.apache.nutch.fs.NutchFileSystem.create 
(NutchFileSystem.ja

va:174)
at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile 
(NDFSFileSy

stem.java:178)
at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile 
(NDFSFile

System.java:153)
at
org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java: 
46  )

at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)



I know I am missing somthing but I can't figure out what.





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: How can I set a search server over NDFS

2005-12-27 Thread Stefan Groschupf

Try to real dns name of the box or 127.0.0.1 instead of localhost.
Any exception?

Stefan
Am 28.12.2005 um 00:42 schrieb Gal Nitzan:


Hi,

I have tried all available samples but was unsuccessful.


I am using the following command to start the server:

bin/nutch-daemon.sh start server 9003 crawl

I have setup a directory /hosts with the file search-servers.txt which
contains localhost 9003

but the tomcat client does not connect to my search server at all.

Any idea what am i doing wrong?

Gal





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




How can I set a search server over NDFS - Revised

2005-12-27 Thread Gal Nitzan
Do I need to run server if I want to use the search to use NDFS?

Any way, in the nutch-site.xml which reside under tomcat the serch.dir =
crawl and the name of the ndfs root is the same.

However, I still get 0 results though I know for sure there are
documents in the index.



On Wed, 2005-12-28 at 01:42 +0200, Gal Nitzan wrote:
 Hi,
 
 I have tried all available samples but was unsuccessful.
 
 
 I am using the following command to start the server:
 
 bin/nutch-daemon.sh start server 9003 crawl
 
 I have setup a directory /hosts with the file search-servers.txt which
 contains localhost 9003
 
 but the tomcat client does not connect to my search server at all.
 
 Any idea what am i doing wrong?
 
 Gal
 
 
 




Re: Trouble setting NDFS on multiple machines

2005-12-27 Thread Stefan Groschupf

Interesting!
That is not a feature that is a bug, may you can open a minor bug  
report.

Thanks.
Stefan
Am 28.12.2005 um 01:35 schrieb Gal Nitzan:


Thanks for the prompt reply. However it seems that the problem was
working with JDK 1.5

When changed to 1.4.2 All seems to be working.

Thanks.

Gal.
On Wed, 2005-12-28 at 01:24 +0100, Stefan Groschupf wrote:

The exception means that one client is unable to connect to one
*datanode*.
Check that the box that had this exception can open a connection to
all other datanodes with the correct port.
try
telnet machineNameAsUsedInNameNode DATANODE_PORT

Is it able to connect?

Stefan

Am 27.12.2005 um 22:20 schrieb Gal Nitzan:


Hi,

For some reason I am having trouble setting NDFS on multiple
machines I
keep on getting an exception.

My settings follows the guide lines i.e Doug's cheat sheet on all
three
machines:

  namefs.default.name/name
  valuenutchmst1.XX.com:9000/value

all machines seems to be connecting to the namenode:

051227 223242 10 Opened server at 50010
051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data
051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec
051227 223242 12 Client connection to x.x.22.185:9000: starting

051227 230013 Server connection on port 9000 from x.x.22.186:  
starting

051227 230013 Got brand-new heartbeat from nutchnd1:50010
051227 230013 Block report from nutchnd1:50010: 0 blocks.
051227 230013 Server connection on port 9000 from x.x.22.183:  
starting

051227 230013 Got brand-new heartbeat from nutchws1:50010
051227 230013 Block report from nutchws1:50010: 0 blocks.

The problem:::

[EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt
051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch-
defaul
t.xml
051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch-
site.x
ml
051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000
051227 230324 Client connection to x.x.22.185:9000: starting
Exception in thread main  
java.lang.ArrayIndexOutOfBoundsException: 2

at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)
[EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt
urls
/urls.txt
051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch-
defaul
t.xml
051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch-
site.x
ml
051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000
051227 230423 Client connection to x.x.22.185:9000: starting
Exception in thread main java.lang.NullPointerException
at java.net.Socket.init(Socket.java:357)
at java.net.Socket.init(Socket.java:207)
at org.apache.nutch.ndfs.NDFSClient
$NDFSOutputStream.nextBlockOu
tputStream(NDFSClient.java:573)
at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init
(NDFS
Client.java:521)
at org.apache.nutch.ndfs.NDFSClient.create 
(NDFSClient.java:83)

at org.apache.nutch.fs.NDFSFileSystem.createRaw
(NDFSFileSystem.j
ava:71)
at org.apache.nutch.fs.NFSDataOutputStream$Summer.init
(NFSData
OutputStream.java:41)
at org.apache.nutch.fs.NFSDataOutputStream.init
(NFSDataOutputS
tream.java:129)
at org.apache.nutch.fs.NutchFileSystem.create
(NutchFileSystem.ja
va:187)
at org.apache.nutch.fs.NutchFileSystem.create
(NutchFileSystem.ja
va:174)
at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile
(NDFSFileSy
stem.java:178)
at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile
(NDFSFile
System.java:153)
at
org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java:
46  )
at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)



I know I am missing somthing but I can't figure out what.





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net








---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: Trouble setting NDFS on multiple machines

2005-12-27 Thread Gal Nitzan
Thanks for the prompt reply. However it seems that the problem was
working with JDK 1.5

When changed to 1.4.2 All seems to be working.

Thanks.

Gal.
On Wed, 2005-12-28 at 01:24 +0100, Stefan Groschupf wrote:
 The exception means that one client is unable to connect to one  
 *datanode*.
 Check that the box that had this exception can open a connection to  
 all other datanodes with the correct port.
 try
 telnet machineNameAsUsedInNameNode DATANODE_PORT
 
 Is it able to connect?
 
 Stefan
 
 Am 27.12.2005 um 22:20 schrieb Gal Nitzan:
 
  Hi,
 
  For some reason I am having trouble setting NDFS on multiple  
  machines I
  keep on getting an exception.
 
  My settings follows the guide lines i.e Doug's cheat sheet on all  
  three
  machines:
 
namefs.default.name/name
valuenutchmst1.XX.com:9000/value
 
  all machines seems to be connecting to the namenode:
 
  051227 223242 10 Opened server at 50010
  051227 223242 11 Starting DataNode in: /nutch/ndfs/data/data
  051227 223242 11 using BLOCKREPORT_INTERVAL of 3500482msec
  051227 223242 12 Client connection to x.x.22.185:9000: starting
 
  051227 230013 Server connection on port 9000 from x.x.22.186: starting
  051227 230013 Got brand-new heartbeat from nutchnd1:50010
  051227 230013 Block report from nutchnd1:50010: 0 blocks.
  051227 230013 Server connection on port 9000 from x.x.22.183: starting
  051227 230013 Got brand-new heartbeat from nutchws1:50010
  051227 230013 Block report from nutchws1:50010: 0 blocks.
 
  The problem:::
 
  [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt
  051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
  defaul
  t.xml
  051227 230324 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
  site.x
  ml
  051227 230324 No FS indicated, using default:nutchmst1.XXX.com:9000
  051227 230324 Client connection to x.x.22.185:9000: starting
  Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 2
  at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)
  [EMAIL PROTECTED] trunk]$ bin/nutch ndfs -copyFromLocal urls.txt  
  urls
  /urls.txt
  051227 230422 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
  defaul
  t.xml
  051227 230423 parsing file:/home/nutchuser/nutch/trunk/conf/nutch- 
  site.x
  ml
  051227 230423 No FS indicated, using default:nutchmst1.xxx.com:9000
  051227 230423 Client connection to x.x.22.185:9000: starting
  Exception in thread main java.lang.NullPointerException
  at java.net.Socket.init(Socket.java:357)
  at java.net.Socket.init(Socket.java:207)
  at org.apache.nutch.ndfs.NDFSClient 
  $NDFSOutputStream.nextBlockOu
  tputStream(NDFSClient.java:573)
  at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init 
  (NDFS
  Client.java:521)
  at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83)
  at org.apache.nutch.fs.NDFSFileSystem.createRaw 
  (NDFSFileSystem.j
  ava:71)
  at org.apache.nutch.fs.NFSDataOutputStream$Summer.init 
  (NFSData
  OutputStream.java:41)
  at org.apache.nutch.fs.NFSDataOutputStream.init 
  (NFSDataOutputS
  tream.java:129)
  at org.apache.nutch.fs.NutchFileSystem.create 
  (NutchFileSystem.ja
  va:187)
  at org.apache.nutch.fs.NutchFileSystem.create 
  (NutchFileSystem.ja
  va:174)
  at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile 
  (NDFSFileSy
  stem.java:178)
  at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile 
  (NDFSFile
  System.java:153)
  at
  org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java: 
  46  )
  at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)
 
 
 
  I know I am missing somthing but I can't figure out what.
 
 
 
 
 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net
 
 




Can we search based on two fileds?

2005-12-27 Thread Kumar Limbu
Hi everyone,

I am currently indexing a single website, say www.somesite.com. But I do not
want to crawl urls with certain pattern let's say nocrawl, ie
www.somesite.com/nocrawl.html or www.somesite.com/apage.php?nocrawl. I want
to discard any urls that contains the pattern 'nocrawl'. How do I do it? I
am using nutch version 7.1. Also I want to use the 'crawl' command for
crawling these pages.

Thank you for you support.

--
Keep on smiling
:) Kumar


Crawler problem in 0.7 and 0.7.1

2005-12-27 Thread Chih How Bong
Hi all,
  I encountered problems when I run nutch 0.7 and 0.7.1 crawler.
  Although I have added a number of root url in a plain text file *urls *as
it the crawler seems unwillingly to fetch any of the urls. However, when In
fall back to the nutch 0.6, everything just works fine under it.
  Therefore, I wondering if this problem happen to all of you? Currently, I
am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I came
across the same problem under my apple Mac too.
  Below are the content of the log of the crawler, it shows that the crawler
returrns 0 entry.
  Thanks in advance.


051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml
051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml
051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml
051227 212143 No FS indicated, using default:local
051227 212143 crawl started in: crawl.test
051227 212143 rootUrlFile = urls
051227 212143 threads = 10
051227 212143 depth = 3
...

...

..051227 212143 *Added 0 pages*
051227 212143 FetchListTool started
051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds.
*051227 212144 Overall processing: Sorted NaN entries/second
051227 212144 FetchListTool completed
051227 212144 logging at INFO
051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db
051227 212145 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212143
051227 212145 Finishing update
051227 212145 Update finished
051227 212145 FetchListTool started
*051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.*
051227 212145 Overall processing: Sorted NaN entries/second
051227 212145 FetchListTool completed
051227 212145 logging at INFO
051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db
051227 212146 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212145
051227 212146 Finishing update
051227 212146 Update finished
051227 212146 FetchListTool started
051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds.
051227 212146 Overall processing: Sorted NaN entries/second
051227 212146 FetchListTool completed
051227 212146 logging at INFO
051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db
051227 212147 Updating for /opt/nutch-0.7.1
/crawl.test/segments/20051227212146
051227 212147 Finishing update
051227 212147 Update finished
051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from /opt/nutch-
0.7.1/crawl.test/db
051227 212147  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143
051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145
051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146
051227 212148 Sorting pages by url...
051227 212148 Getting updated scores and anchors from db...
051227 212148 Sorting updates by segment...
051227 212148 Updating segments...
051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from
/opt/nutch-0.7.1/crawl.test/db
051227 212148 indexing segment: /opt/nutch- 0.7.1
/crawl.test/segments/20051227212143
051227 212148 * Opening segment 20051227212143
051227 212148 * Indexing segment 20051227212143
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
051227 212148 DONE indexing segment 20051227212143: total 0 records in
0.026s (NaN rec/s).
051227 212148 done indexing
051227 212148 indexing segment: /opt/nutch-0.7.1
/crawl.test/segments/20051227212145
051227 212148 * Opening segment 20051227212145
051227 212148 * Indexing segment 20051227212145
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
051227 212148 DONE indexing segment 20051227212145: total 0 records in
0.075s (NaN rec/s).
051227 212148 done indexing
051227 212148 indexing segment: /opt/nutch-0.7.1
/crawl.test/segments/20051227212146
051227 212148 * Opening segment 20051227212146
051227 212148 * Indexing segment 20051227212146
051227 212148 * Optimizing index...
051227 212148 * Moving index to NFS if needed...
*051227 212148 DONE indexing segment 20051227212146: total 0 records in
0.011 s (NaN rec/s).
*051227 212148 done indexing
051227 212148 Reading url hashes...
051227 212148 Sorting url hashes...
051227 212148 Deleting url duplicates...
051227 212148 Deleted 0 url duplicates.
051227 212148 Reading content hashes...
051227 212148 Sorting content hashes...
051227 212148 Deleting content duplicates...
051227 212148 Deleted 0 content duplicates.
051227 212148 Duplicate deletion complete locally.   Now returning to NFS...
051227 212148 DeleteDuplicates complete
051227 212148 Merging segment indexes...
051227 212148 crawl finished: crawl.test

Rgds
Chih-How Bong


Re: Crawler problem in 0.7 and 0.7.1

2005-12-27 Thread Pushpesh Kr. Rajwanshi
Hi there,

Can u check ur crawl filter.txt file? I guess there is slight handling
problem in code.

+^http://([a-z0-9]*\.)*google.com

works
but

+^http://([a-z0-9]*\.)*google.com/

doesnt work

U see the leading slash messes and wont allow to inject urls. So try
removing / at the end in crawlurl filter.txt file and then it should work

HTH
Pushpesh


On 12/28/05, Chih How Bong [EMAIL PROTECTED] wrote:

 Hi all,
 I encountered problems when I run nutch 0.7 and 0.7.1 crawler.
 Although I have added a number of root url in a plain text file *urls *as
 it the crawler seems unwillingly to fetch any of the urls. However, when
 In
 fall back to the nutch 0.6, everything just works fine under it.
 Therefore, I wondering if this problem happen to all of you? Currently, I
 am running nutch 0.7.1 with JDK1.5 update 6 on Ubuntu 5.10. Anywhere I
 came
 across the same problem under my apple Mac too.
 Below are the content of the log of the crawler, it shows that the crawler
 returrns 0 entry.
 Thanks in advance.


 051227 212142 parsing file:/opt/nutch-0.7.1/conf/nutch-default.xml
 051227 212143 parsing file:/opt/nutch- 0.7.1/conf/crawl-tool.xml
 051227 212143 parsing file:/opt/nutch-0.7.1/conf/nutch-site.xml
 051227 212143 No FS indicated, using default:local
 051227 212143 crawl started in: crawl.test
 051227 212143 rootUrlFile = urls
 051227 212143 threads = 10
 051227 212143 depth = 3
 ...

 ...

 ..051227 212143 *Added 0 pages*
 051227 212143 FetchListTool started
 051227 212144 *Overall processing: Sorted 0 entries in 0.0 seconds.
 *051227 212144 Overall processing: Sorted NaN entries/second
 051227 212144 FetchListTool completed
 051227 212144 logging at INFO
 051227 212145 Updating /opt/nutch-0.7.1/crawl.test/db
 051227 212145 Updating for /opt/nutch-0.7.1
 /crawl.test/segments/20051227212143
 051227 212145 Finishing update
 051227 212145 Update finished
 051227 212145 FetchListTool started
 *051227 212145 Overall processing: Sorted 0 entries in 0.0 seconds.*
 051227 212145 Overall processing: Sorted NaN entries/second
 051227 212145 FetchListTool completed
 051227 212145 logging at INFO
 051227 212146 Updating /opt/nutch-0.7.1/crawl.test/db
 051227 212146 Updating for /opt/nutch-0.7.1
 /crawl.test/segments/20051227212145
 051227 212146 Finishing update
 051227 212146 Update finished
 051227 212146 FetchListTool started
 051227 212146 Overall processing: Sorted 0 entries in 0.0 seconds.
 051227 212146 Overall processing: Sorted NaN entries/second
 051227 212146 FetchListTool completed
 051227 212146 logging at INFO
 051227 212147 Updating /opt/nutch- 0.7.1/crawl.test/db
 051227 212147 Updating for /opt/nutch-0.7.1
 /crawl.test/segments/20051227212146
 051227 212147 Finishing update
 051227 212147 Update finished
 051227 212147 Updating /opt/nutch-0.7.1/crawl.test/segments from
 /opt/nutch-
 0.7.1/crawl.test/db
 051227 212147  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212143
 051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212145
 051227 212148  reading /opt/nutch-0.7.1/crawl.test/segments/20051227212146
 051227 212148 Sorting pages by url...
 051227 212148 Getting updated scores and anchors from db...
 051227 212148 Sorting updates by segment...
 051227 212148 Updating segments...
 051227 212148 Done updating /opt/nutch-0.7.1/crawl.test/segments from
 /opt/nutch-0.7.1/crawl.test/db
 051227 212148 indexing segment: /opt/nutch- 0.7.1
 /crawl.test/segments/20051227212143
 051227 212148 * Opening segment 20051227212143
 051227 212148 * Indexing segment 20051227212143
 051227 212148 * Optimizing index...
 051227 212148 * Moving index to NFS if needed...
 051227 212148 DONE indexing segment 20051227212143: total 0 records in
 0.026s (NaN rec/s).
 051227 212148 done indexing
 051227 212148 indexing segment: /opt/nutch-0.7.1
 /crawl.test/segments/20051227212145
 051227 212148 * Opening segment 20051227212145
 051227 212148 * Indexing segment 20051227212145
 051227 212148 * Optimizing index...
 051227 212148 * Moving index to NFS if needed...
 051227 212148 DONE indexing segment 20051227212145: total 0 records in
 0.075s (NaN rec/s).
 051227 212148 done indexing
 051227 212148 indexing segment: /opt/nutch-0.7.1
 /crawl.test/segments/20051227212146
 051227 212148 * Opening segment 20051227212146
 051227 212148 * Indexing segment 20051227212146
 051227 212148 * Optimizing index...
 051227 212148 * Moving index to NFS if needed...
 *051227 212148 DONE indexing segment 20051227212146: total 0 records in
 0.011 s (NaN rec/s).
 *051227 212148 done indexing
 051227 212148 Reading url hashes...
 051227 212148 Sorting url hashes...
 051227 212148 Deleting url duplicates...
 051227 212148 Deleted 0 url duplicates.
 051227 212148 Reading content hashes...
 051227 212148 Sorting content hashes...
 051227 212148 Deleting content duplicates...
 051227 212148 Deleted 0 content duplicates.
 051227 212148 Duplicate deletion complete locally.   Now returning to
 NFS...
 051227 212148 DeleteDuplicates 

Re: Is any one able to successfully run Distributed Crawl?

2005-12-27 Thread Nutch Newbie
Have you tried the following:

http://wiki.apache.org/nutch/HardwareRequirements

and

http://wiki.apache.org/nutch/

There are no quick answer if one is planning to crawl million
pages..Read..Try.. Read..


On 12/28/05, Pushpesh Kr. Rajwanshi [EMAIL PROTECTED] wrote:
 Hi,

 I want to know if anyone is able to successfully run distributed crawl on
 multiple machines involving crawling millions of pages? and how hard is to
 do that? Do i just have to do some configuration and set up or do some
 implementations also?

 Also can anyone tell me if i want to crawl around 20,000 websites (say for
 depth 5) in a day, is it possible and if yes then how many machines would i
 roughly require? and what all configurations i will need? I would appreciate
 even some very approximate numbers also as i can understand it might not be
 trivial to find out or may be :-)

 TIA
 Pushpesh