Boost urls to crawl by anchor text

2010-01-18 Thread Eran Zinman
Hi all,

I've created a custom scoring filter plugin which overrides the
ScoringFilter class.

My main goal is once a certain page is fetched and parsed, I wish to analyze
it's outlinks and decide to which links to go next. One of the criterias
which help me decide - is the link anchor text.

For example, if a certain link from the current page has an anchor text that
contain the word Games I which to boost it so it will be fetched on the
next round.

From what I've seen, the *updateDbScore(Text url, CrawlDatum old, CrawlDatum
datum, ListCrawlDatum inlinked)* function receives only the URL text and I
have no access to the URL anchor text -

any idea how I can get the anchor text of a certain URL in the
updateDbScore function?

Thanks,
Eran


Nutch Hadoop 0.20 - AlreadyBeingCreatedException

2009-12-17 Thread Eran Zinman
Hi,

I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of
Nutch parser reduce tasks.

I know this is a known issue with Nutch (
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058
)

And as far as I can see that patch wasn't committed yet because we wanted to
examine it on the new Hadoop 0.20 version. I am using latest Nutch with
Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it
does) - maybe we should commit the change?

Thanks,
Eran


Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh

I don't know what to do! I've tried all kind of stuff but with no luck... :(

*hadoop-eran-jobtracker-master.log*
2009-12-09 12:04:53,965 FATAL mapred.JobTracker -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1610)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:180)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:172)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3699)

*hadoop-eran-namenode-master.log*
2009-12-09 12:04:27,583 ERROR namenode.NameNode -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

Thanks for trying to help,
Eran

On Sun, Dec 6, 2009 at 3:51 PM, Eran Zinman zze...@gmail.com wrote:

 Hi,

 Just upgraded to the latest version of Nutch with Hadoop 0.20.

 I'm getting the following exception in the namenode log and DFS doesn't
 start:

 2009-12-06 15:48:32,523 ERROR namenode.NameNode -
 java.lang.SecurityException: sealing violation: can't seal package
 org.mortbay.util: already loaded
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

 Any help will be appreciated ... quite stuck with this.

 Thanks,
 Eran



Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Andrzej,

Thanks for your help (as always).

Still getting same exception when running on standalone Hadoop cluster.
Getting same exceptions as before -  also in the datanode log I'm getting:

2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection
reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote:

 Eran Zinman wrote:

 Hi,

 Sorry to bother you guys again, but it seems that no matter what I do I
 can't run the new version of Nutch with Hadoop 0.20.

 I am getting the following exceptions in my logs when I execute
 bin/start-all.sh


 Do you use the scripts in place, i.e. without deploying the nutch*.job to a
 separate Hadoop cluster? Could you please try it with a standalone Hadoop
 cluster (even if it's a pseudo-distributed, i.e. single node)?


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi,

Running new Nutch version status:

1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a single
or cluster setup.

*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
can't seal package org.mortbay.util: already loaded

I thought it might be a good idea that I'll attach my Hadoop conf files, so
here they are:

*core-site.xml*
configuration
property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
/property
/configuration

*mapred-site.xml*
configuration
property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
/property

property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
/property

property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
/property
/configuration

*hdfs-site.xml*
configuration
property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
/property

property
  namedfs.data.dir/name
  value/my_crawler/filesystem/data/value
/property

property
  namedfs.replication/name
  value2/value
/property
/configuration

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:

 Hi Andrzej,

 Thanks for your help (as always).

 Still getting same exception when running on standalone Hadoop cluster.
 Getting same exceptions as before -  also in the datanode log I'm getting:

 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
 to 10.0.0.2:9000 failed on local exception: java.io.IOException:
 Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
 at org.apache.hadoop.ipc.Client.call(Client.java:742)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 at
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at
 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readInt(DataInputStream.java:370)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote:

 Eran Zinman wrote:

 Hi,

 Sorry to bother you guys again, but it seems that no matter what I do I
 can't run the new version of Nutch with Hadoop 0.20.

 I am getting the following exceptions in my logs when I execute
 bin/start-all.sh


 Do you use the scripts in place, i.e. without deploying the nutch*.job to
 a separate Hadoop cluster? Could you please try it with a standalone Hadoop
 cluster (even if it's a pseudo-distributed, i.e. single node)?


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Dennis,

1) I've initially tried to run on my existing DFS and it didn't work. I then
made a backup of my DFS and performed a format and it still didn't work...

2) I'm using:

java version 1.6.0_0
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

3) My environment variables:

ORBIT_SOCKETDIR=/tmp/orbit-eran
SSH_AGENT_PID=3533
GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
TERM=xterm
SHELL=/bin/bash
XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
WINDOWID=54653392
GTK_MODULES=canberra-gtk-module
USER=eran
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:
GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
USERNAME=eran
DESKTOP_SESSION=default
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
GDM_XSERVER_LOCATION=local
PWD=/home/eran
JAVA_HOME=/usr/lib/jvm/default-java/
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
GDMSESSION=default
HISTCONTROL=ignoreboth
SHLVL=1
HOME=/home/eran
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=eran
XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
LESSOPEN=| /usr/bin/lesspipe %s
WINDOWPATH=7
DISPLAY=:0.0
LESSCLOSE=/usr/bin/lesspipe %s %s
XAUTHORITY=/home/eran/.Xauthority
COLORTERM=gnome-terminal
_=/usr/bin/printenv

Thanks,
Eran


On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

 1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

 Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
 /property

 property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
 /property

 property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
 /property
 /configuration

 *hdfs-site.xml*
 configuration
 property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
 /property

 property
  namedfs.data.dir/name
  value/my_crawler/filesystem/data/value
 /property

 property
  namedfs.replication/name
  value2/value
 /property
 /configuration

 Thanks,
 Eran

 On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:

  Hi Andrzej,

 Thanks for your help (as always).

 Still getting same exception when running on standalone Hadoop cluster.
 Getting same exceptions as before -  also in the datanode log I'm
 getting:

 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException:
 Call
 to 10.0.0.2:9000 failed on local exception: java.io.IOException:
 Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Dennis,

Thanks for trying to help.

I don't know what fresh install means exactly.

Here is what I've done:
1) Downloaded latest version of Nutch from the SVN to a new folder.
2) Copied all the custom plugins I've written to the new folder
3) Edited all configuration files.
4) Executed ant package.
5) Run the new Nutch... and got this error.

What did I miss?

Thanks,
Eran

On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote:

 Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy
 over the new jars?  The sealing violation is multiple of the same jars being
 loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop?

 Dennis


 Eran Zinman wrote:

 Hi Dennis,

 1) I've initially tried to run on my existing DFS and it didn't work. I
 then
 made a backup of my DFS and performed a format and it still didn't work...

 2) I'm using:

 java version 1.6.0_0
 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
 OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

 3) My environment variables:

 ORBIT_SOCKETDIR=/tmp/orbit-eran
 SSH_AGENT_PID=3533
 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
 TERM=xterm
 SHELL=/bin/bash

 XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
 WINDOWID=54653392
 GTK_MODULES=canberra-gtk-module
 USER=eran

 LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0


 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:

 GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
 SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
 SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
 USERNAME=eran
 DESKTOP_SESSION=default

 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
 GDM_XSERVER_LOCATION=local
 PWD=/home/eran
 JAVA_HOME=/usr/lib/jvm/default-java/
 LANG=en_US.UTF-8
 GDM_LANG=en_US.UTF-8
 GDMSESSION=default
 HISTCONTROL=ignoreboth
 SHLVL=1
 HOME=/home/eran
 GNOME_DESKTOP_SESSION_ID=this-is-deprecated
 LOGNAME=eran
 XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/

 DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
 LESSOPEN=| /usr/bin/lesspipe %s
 WINDOWPATH=7
 DISPLAY=:0.0
 LESSCLOSE=/usr/bin/lesspipe %s %s
 XAUTHORITY=/home/eran/.Xauthority
 COLORTERM=gnome-terminal
 _=/usr/bin/printenv

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

  1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

  Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing
 violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
   The name of the default file system. Either the literal string
   local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
   The host and port that the MapReduce job tracker runs at. If
   local, then jobs are run in-process as a single map and
   reduce task.
  /description
 /property

 property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
 /property

 property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
 /property
 /configuration

 *hdfs-site.xml*
 configuration
 property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
 /property

 property
  namedfs.data.dir/name
  value/my_crawler/filesystem

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi all,

thanks Dennis - you helped me solve the problem.

The problem was that I had two versions of jetty in my lib folder.

I deleted the old version and viola - it works.

The problem is that both versions exist in the SVN! Altough I took a fresh
copy of the SVN I had both versions in my lib folder. I think we need to
remove the old version from the SVN so people like me won't get confused ...

Thanks !
Eran.

On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote:

 Hi Dennis,

 Thanks for trying to help.

 I don't know what fresh install means exactly.

 Here is what I've done:
 1) Downloaded latest version of Nutch from the SVN to a new folder.
 2) Copied all the custom plugins I've written to the new folder
 3) Edited all configuration files.
 4) Executed ant package.
 5) Run the new Nutch... and got this error.

 What did I miss?

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote:

 Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy
 over the new jars?  The sealing violation is multiple of the same jars being
 loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop?

 Dennis


 Eran Zinman wrote:

 Hi Dennis,

 1) I've initially tried to run on my existing DFS and it didn't work. I
 then
 made a backup of my DFS and performed a format and it still didn't
 work...

 2) I'm using:

 java version 1.6.0_0
 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
 OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

 3) My environment variables:

 ORBIT_SOCKETDIR=/tmp/orbit-eran
 SSH_AGENT_PID=3533
 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
 TERM=xterm
 SHELL=/bin/bash

 XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
 WINDOWID=54653392
 GTK_MODULES=canberra-gtk-module
 USER=eran

 LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0


 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:

 GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
 SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
 SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
 USERNAME=eran
 DESKTOP_SESSION=default

 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
 GDM_XSERVER_LOCATION=local
 PWD=/home/eran
 JAVA_HOME=/usr/lib/jvm/default-java/
 LANG=en_US.UTF-8
 GDM_LANG=en_US.UTF-8
 GDMSESSION=default
 HISTCONTROL=ignoreboth
 SHLVL=1
 HOME=/home/eran
 GNOME_DESKTOP_SESSION_ID=this-is-deprecated
 LOGNAME=eran
 XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/

 DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
 LESSOPEN=| /usr/bin/lesspipe %s
 WINDOWPATH=7
 DISPLAY=:0.0
 LESSCLOSE=/usr/bin/lesspipe %s %s
 XAUTHORITY=/home/eran/.Xauthority
 COLORTERM=gnome-terminal
 _=/usr/bin/printenv

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

  1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

  Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing
 violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf
 files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
   The name of the default file system. Either the literal string
   local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
   The host and port that the MapReduce job tracker runs

Nutch Hadoop 0.20 - Exception

2009-12-06 Thread Eran Zinman
Hi,

Just upgraded to the latest version of Nutch with Hadoop 0.20.

I'm getting the following exception in the namenode log and DFS doesn't
start:

2009-12-06 15:48:32,523 ERROR namenode.NameNode -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

Any help will be appreciated ... quite stuck with this.

Thanks,
Eran


Nutch - create my own repository

2009-12-05 Thread Eran Zinman
Hi,

I'm developing my own set of tools, plugins and some minor code changes to
Nutch.

I still want to get updates from the main Nutch repository, but I would
like to keep my own SVN for tracking my local code changes.

I'm using normal shell SVN (I have no expirence with GIT) to track my
changes.

My question is - can I create a branch from the main repository to my own
repository, which will only track my changes and keep getting updates from
Nutch main repository with easy merge?

Thanks,
Eran


Re: Efficient focused crawling

2009-11-28 Thread Eran Zinman
Thanks for your help MillBii!

I will definitely try the squareroot option - but is that only valid for
outlinks or also affects pages linking to the page?

Did you try implementing automatic Regex generation? I'm doing focused
crawling but I'm also thinking about scaling it in the future.

Also I will be happy to know if anyone else have any other suggestion (or
already implemented strategy) - I think this issue affects most of the Nutch
community - at least people that use Nutch for focused crawling.

Thanks,
Eran

On Fri, Nov 27, 2009 at 8:29 PM, MilleBii mille...@gmail.com wrote:

 Well  I have created for my own application is topical-scoring plugin :

 1.  first I needed to score the pages after parsing based on my regular
 expression

 2. then I searched several options on to how boost score of that pages... I
 have only found a way to boost the score of the outlinks of these pages
 that
 have content which I wanted. Not perfect but so be it there is a high
 likelyhood in my case that adjacent pages have also content which I want.

 3. then how to boost the score... this took me a while to figure out, I
 leave you all the options I tried. The good comprise I found is the
 following:
   if the page has content I want and score  1.0f than score=
 squareroot(score)... in this way you are adding weight to the pages which
 have content you are looking  (since score is usually below 1.
 squareroot(x)
 is bigger than x).

 Of course there are some down side to that approach, it is more difficult
 to
 get the crawler to go outsides sites that have content your are looking
 for,
 it is a bit like digging a hole and until you have finished the hole it
 will
 get the crawler to explore it... experimentally I have found that it works
 nicely for me though, if you limit the nbre of URLS per site it won't spend
 it's life on them.

 We could try to generalize this plug-in by putting the regular expression
 as
 as config item because that is really the only thing which is specific to
 my
 application I believe.



 2009/11/27 Eran Zinman zze...@gmail.com

  Hi all,
 
  I'm try to figure out ways to improve Nutch focused crawling efficiency.
 
  I'm looking for certain pages inside each domain which contains content
 I'm
  looking for.
 
  I'm unable to know that a certain URL contains what I'm looking for
 unless
  I
  parse it and do some analysis on it.
 
  Basically I was thinking about two methods to improve crawling
 efficiency:
 
  1) Whenever a page is found which contains the data I'm looking for,
  improve
  overall score for all pages linking to it (and pages linking to them and
 so
  on...), assuming they have other links that point to content I'm looking
  for.
  2) Once I already found several pages that contain relevant data - create
 a
  Regex automatically to match new urls which might contain usable content.
 
  I've started to read about the OPIC-score plugin but was unable to
  understand if it can help me or not with issue no. 1.
 
  Any idea guys? I will be very grateful for any help or things that can
  point
  me in the right direction.
 
  Thanks,
  Eran
 



 --
 -MilleBii-



Re: Efficient focused crawling

2009-11-28 Thread Eran Zinman
Hi MilleBii,

I think you misinterpreted what I've meant.

1. Regarding Regex - I know I can build a Regex beforehand to identify URLs,
but I will have to create one manually for each domain I'm crawling - not
scalable. I'm looking for a way to build Regex automatically using automatic
machine learning. I know to identify if a certain page contains the content
I'm looking for only after I parse it. I want my crawler to create automatic
Regex patterns based on it's crawling experience.

2. I want to boost inlinks not necessarily to crawl them again, but to crawl
in higher priority other links they link to, taking under assumption these
links might contain the content I'm looking for.

Thanks for your help!

Eran



On Sat, Nov 28, 2009 at 10:56 AM, MilleBii mille...@gmail.com wrote:

 oops : why it shouldn't work for others.

 2009/11/28 MilleBii mille...@gmail.com

  I just use the Java build-in regex features... and therefore just
 supplied
  the string, which I design for my case using RegexBuddy a really great
 tool
  by the way.
 
  Pay attention though at static creation in order to avoid regex creation
 at
  each plug-in load and run-time hit.
 
  Didn't find a way to modify inlinks... on the other hand  inlinks you
 have
  gone through already when you are evaluating a given page so I did not
  bother and it works fine for me, I don't see why it should work for
 others.
 
 
  2009/11/28 Eran Zinman zze...@gmail.com
 
  Thanks for your help MillBii!
 
  I will definitely try the squareroot option - but is that only valid for
  outlinks or also affects pages linking to the page?
 
  Did you try implementing automatic Regex generation? I'm doing focused
  crawling but I'm also thinking about scaling it in the future.
 
  Also I will be happy to know if anyone else have any other suggestion
 (or
  already implemented strategy) - I think this issue affects most of the
  Nutch
  community - at least people that use Nutch for focused crawling.
 
  Thanks,
  Eran
 
  On Fri, Nov 27, 2009 at 8:29 PM, MilleBii mille...@gmail.com wrote:
 
   Well  I have created for my own application is topical-scoring plugin
 :
  
   1.  first I needed to score the pages after parsing based on my
 regular
   expression
  
   2. then I searched several options on to how boost score of that
  pages... I
   have only found a way to boost the score of the outlinks of these
 pages
   that
   have content which I wanted. Not perfect but so be it there is a high
   likelyhood in my case that adjacent pages have also content which I
  want.
  
   3. then how to boost the score... this took me a while to figure out,
 I
   leave you all the options I tried. The good comprise I found is the
   following:
 if the page has content I want and score  1.0f than score=
   squareroot(score)... in this way you are adding weight to the pages
  which
   have content you are looking  (since score is usually below 1.
   squareroot(x)
   is bigger than x).
  
   Of course there are some down side to that approach, it is more
  difficult
   to
   get the crawler to go outsides sites that have content your are
 looking
   for,
   it is a bit like digging a hole and until you have finished the hole
 it
   will
   get the crawler to explore it... experimentally I have found that it
  works
   nicely for me though, if you limit the nbre of URLS per site it won't
  spend
   it's life on them.
  
   We could try to generalize this plug-in by putting the regular
  expression
   as
   as config item because that is really the only thing which is specific
  to
   my
   application I believe.
  
  
  
   2009/11/27 Eran Zinman zze...@gmail.com
  
Hi all,
   
I'm try to figure out ways to improve Nutch focused crawling
  efficiency.
   
I'm looking for certain pages inside each domain which contains
  content
   I'm
looking for.
   
I'm unable to know that a certain URL contains what I'm looking for
   unless
I
parse it and do some analysis on it.
   
Basically I was thinking about two methods to improve crawling
   efficiency:
   
1) Whenever a page is found which contains the data I'm looking for,
improve
overall score for all pages linking to it (and pages linking to them
  and
   so
on...), assuming they have other links that point to content I'm
  looking
for.
2) Once I already found several pages that contain relevant data -
  create
   a
Regex automatically to match new urls which might contain usable
  content.
   
I've started to read about the OPIC-score plugin but was unable to
understand if it can help me or not with issue no. 1.
   
Any idea guys? I will be very grateful for any help or things that
 can
point
me in the right direction.
   
Thanks,
Eran
   
  
  
  
   --
   -MilleBii-
  
 
 
 
 
  --
  -MilleBii-
 



 --
 -MilleBii-



Efficient focused crawling

2009-11-27 Thread Eran Zinman
Hi all,

I'm try to figure out ways to improve Nutch focused crawling efficiency.

I'm looking for certain pages inside each domain which contains content I'm
looking for.

I'm unable to know that a certain URL contains what I'm looking for unless I
parse it and do some analysis on it.

Basically I was thinking about two methods to improve crawling efficiency:

1) Whenever a page is found which contains the data I'm looking for, improve
overall score for all pages linking to it (and pages linking to them and so
on...), assuming they have other links that point to content I'm looking
for.
2) Once I already found several pages that contain relevant data - create a
Regex automatically to match new urls which might contain usable content.

I've started to read about the OPIC-score plugin but was unable to
understand if it can help me or not with issue no. 1.

Any idea guys? I will be very grateful for any help or things that can point
me in the right direction.

Thanks,
Eran


Re: Nutch - Focused crawling

2009-11-23 Thread Eran Zinman
Thanks Julien,

I can confirm this patch works perfectly and does a good job of keeping a
good crawl rate.

We have doubled the rate of information retrieval by using a time limit on
the fetch queue.

Thanks,
Eran

On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi guys,

 I've separated both functionalities into separate patches on JIRA
 (NUTCH-769
 / NUTCH-770).

 Julien
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/11/21 Julien Nioche lists.digitalpeb...@gmail.com

  Hi Eran,
 
  There is currently no time limit implemented in the Fetcher. We
 implemented
  one which worked quite well in combination with another mechanism which
  clears the URLs from a pool if more than x successive exceptions have
 been
  encountered. This limits cases where a site or domain is not responsive.
 
  I might try and submit a patch if I find the time next week, our code has
  been heavily modified with the previous patches which have not been
  committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd
 need
  to spend a bit of time extracting this specific functionality from the
 rest.
 
  Best,
 
  Julien
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
 
  2009/11/21 Eran Zinman zze...@gmail.com
 
  Hi,
 
  We've been using Nutch for focused crawling (right now we are crawling
  about
  50 domains).
 
  We've encountered the long-tail problem - We've set TopN to 100,000 and
  generate.max.per.host to about 1500.
 
  90% of all domains finish fetching after 30min, and the other 10% takes
 an
  additional 2.5 hours - making the slowest domain the bottleneck of the
  entire fetch process.
 
  I've read Ken Krugler document and he's describing the same problem:
 
 
 http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
 
  I'm wondering - does anyone have a suggestion on what's the best way to
  tackle this issue?
 
  I think that Ken suggested to limit the fetch time - for example say
  terminate after 1 hour, even if you are not done yet, is that feature
  available in Nutch?
 
  I will be happy to try and contribute code if required!
 
  Thanks,
  Eran
 
 
 



Nutch - Focused crawling

2009-11-21 Thread Eran Zinman
Hi,

We've been using Nutch for focused crawling (right now we are crawling about
50 domains).

We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.

90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5 hours - making the slowest domain the bottleneck of the
entire fetch process.

I've read Ken Krugler document and he's describing the same problem:
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

I'm wondering - does anyone have a suggestion on what's the best way to
tackle this issue?

I think that Ken suggested to limit the fetch time - for example say
terminate after 1 hour, even if you are not done yet, is that feature
available in Nutch?

I will be happy to try and contribute code if required!

Thanks,
Eran


Re: Nutch Hadoop question

2009-11-13 Thread Eran Zinman
Hi All,

Don't want to bother you guys too much... I've tried searching for this
topic and do some testing myself but so far was quite unsuccessful.

Basically - I wish to use some computers only for map-reduce processing and
not for HDFS, does anyone know how this can be done?

Thanks,
Eran

On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote:

 Hi All,

 I'm using Nutch with Hadoop with great pleasure - working great and really
 increase crawling performance on multiple machines.

 I have two strong machines and two older machines which I would like to
 use.

 So far I've been using only the two strong machines with Hadoop.

 Now I would like to add the two less powerful machines to do some
 processing as well.

 My question is - Right now the HDFS is shared between the two powerful
 computers. I don't want the two other computer to store any content on them
 as they have a slow and unreliable harddisk. I just want the two other
 machines to do processing (i.e. mapreduce) and not store any content on
 them.

 Is that possible - or do I have to use HDFS on all machines that do
 processing?

 If it's possible to use a machine only for mapreduce - how this is done?

 Thank you for your help,
 Eran



Re: Nutch Hadoop question

2009-11-13 Thread Eran Zinman
Thanks for the help guys.

On Fri, Nov 13, 2009 at 5:20 PM, Andrzej Bialecki a...@getopt.org wrote:

 TuxRacer69 wrote:

 Hi Eran,

 mapreduce has to store its data on HDFS file system.


 More specifically, it needs read/write access to a shared filesystem. If
 you are brave enough you can use NFS, too, or any other type of filesystem
 that can be mounted locally on each node (e.g. a NetApp).


 But if you want to separate the two groups of servers, you could build two
 separate HDFS filesystems. To separate the two setups, you will need to make
 sure there is no cross communication between the two parts,


 You can run two separate clusters even on the same set of machines, just
  configure them to use different ports AND different local paths.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Nutch Hadoop question

2009-11-11 Thread Eran Zinman
Hi All,

I'm using Nutch with Hadoop with great pleasure - working great and really
increase crawling performance on multiple machines.

I have two strong machines and two older machines which I would like to use.

So far I've been using only the two strong machines with Hadoop.

Now I would like to add the two less powerful machines to do some processing
as well.

My question is - Right now the HDFS is shared between the two powerful
computers. I don't want the two other computer to store any content on them
as they have a slow and unreliable harddisk. I just want the two other
machines to do processing (i.e. mapreduce) and not store any content on
them.

Is that possible - or do I have to use HDFS on all machines that do
processing?

If it's possible to use a machine only for mapreduce - how this is done?

Thank you for your help,
Eran


including code between plugins

2009-11-02 Thread Eran Zinman
Hi,

I've written my own plugin that's doing some custom parsing.

I've needed language parsing in that plugin and the language-identifier
plugin is wokring great for my needs.

However, I can't use the language identifier plugin as it is, since I want
to parse only a small portion of the webpage.

I've used the language identifier functions and it worked great in eclipse,
but when I try to compile my plugin I'm unable to compile it since it
depends on the language-identifier source code.

My question is - how can I include the language identifier code in my plugin
code without actually using the language-identifier plugin?

Thank you for your help!

Thanks,
Eran


Re: including code between plugins

2009-11-02 Thread Eran Zinman
Hi Andrzej,

thank you so much! that worked like a charm!

I've spent so much time trying to figure this out and you helped me solve it
in 5 min!

Thanks!
Eran



On Mon, Nov 2, 2009 at 11:13 AM, Andrzej Bialecki a...@getopt.org wrote:

 Eran Zinman wrote:

 Hi,

 I've written my own plugin that's doing some custom parsing.

 I've needed language parsing in that plugin and the language-identifier
 plugin is wokring great for my needs.

 However, I can't use the language identifier plugin as it is, since I want
 to parse only a small portion of the webpage.

 I've used the language identifier functions and it worked great in
 eclipse,
 but when I try to compile my plugin I'm unable to compile it since it
 depends on the language-identifier source code.

 My question is - how can I include the language identifier code in my
 plugin
 code without actually using the language-identifier plugin?


 You need to add the language-identifier plugin to the requires section in
 your plugin.xml, like this:

   requires
  import plugin=nutch-extensionpoints/
  import plugin=language-identifier/
   /requires


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Extract full urls from DOM

2009-10-29 Thread Eran Zinman
Hello everyone,

I've created a plugin for Nutch 1.0 that extends the parser.

This plugin extract several kinds of information from the document DOM.

In some cases I need to extract an href of a certain link. The link in the
DOM is still relative as it was originally written in the html document, so
for example it might be a link with an href of /music.

My question is - how can I make this link have an absolute url - for example
make /music to http://www.example.com/music;?

Thanks a lot,
Eran


Re: Extract full urls from DOM

2009-10-29 Thread Eran Zinman
Ken,

Thanks a lot! You solved my problem.

Thanks,
Eran

On Thu, Oct 29, 2009 at 2:35 PM, Ken Krugler kkrugler_li...@transpac.comwrote:


 On Oct 29, 2009, at 4:00am, Eran Zinman wrote:

  Hello everyone,

 I've created a plugin for Nutch 1.0 that extends the parser.

 This plugin extract several kinds of information from the document DOM.

 In some cases I need to extract an href of a certain link. The link in
 the
 DOM is still relative as it was originally written in the html document,
 so
 for example it might be a link with an href of /music.

 My question is - how can I make this link have an absolute url - for
 example
 make /music to http://www.example.com/music;?


 new URL(baseUrl, relativeString)

 will return the full URL, leaving aside a few minor edge cases.

 The baseUrl will be the URL of the containing document, or the value of the
 (potentially relative) location: response header field if it exists, or the
 value of the base tag in the head element, if that exists.

 -- Ken


 --
 Ken Krugler
 TransPac Software, Inc.
 http://www.transpac.com
 +1 530-210-6378




Re: Plug-ins during Nutch Crawl

2009-10-21 Thread Eran Zinman
Hi Shreekanth Prabhu,

I was facing the same problem.

You don't need to recrawl all the urls from scratch.

Remember - you have the segments already in your harddisk. It really depends
where you've done the date parsing. If you done in as part of the parser you
can run the parser again.

If you done it just before indexing you job is even easier - you can just
run the indexer again. Remember - each Nutch component can run as a
standalone application.

BTW - I'm really looking for a JAVA based date parser myself that can parse
dates written by humans (or website owners) - is it possible for you to
share the date parser or if you are using an open-source one, give me a
recommendation on one I can use?

Thanks,
Eran

On Wed, Oct 21, 2009 at 9:47 AM, sprabhu_PN 
shreekanth.pra...@pinakilabs.com wrote:



 We have added a few plug-ins such as date parsing plug-in that get
 exercised
 during a Nutch crawl and update a field in each index record. Now we find
 that we need to improve the plug-in and re-run it. Is the only option to
 crawl the whole index once again ? Is there any way we can do a recrawl
 which will just exercise newer versions of plug-ins and take less time to
 do
 it ?

 Thanks in advance.

 Regards
 Shreekanth Prabhu
 --
 View this message in context:
 http://www.nabble.com/Plug-ins-during-Nutch-Crawl-tp25987956p25987956.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Combining parsed data from two sources before indexing

2009-09-08 Thread Eran Zinman
Hi,

I'm also quite interested in this feature.

I want to combine information from two different pages and I don't know
which one will be downloaded first.

Only when both are downloaded I want to process them.

Thanks,
Eran

On Wed, Sep 9, 2009 at 12:51 AM, Max S maximillian...@googlemail.comwrote:

 Hi all,

 How can I combine parsed data from two sources before indexing them? At the
 moment, the way I see it (correct me if I'm wrong), each page (fetched) is
 treated as a separate document. These documents are related only by their
 inlinks / outlinks.

 What if there are contents that have been divided into a few web page. How
 do combine them together before indexing it?

 Regards
 Max S




DocuemntFragement and XPath

2009-09-03 Thread Eran Zinman
Hi,

I've created a plugin on Nutch 1.0 that extends the HtmlParseFilter.

I wanted to extract some more information from the HTML document.

I've got all the parameters into the filter function and then I wanted to
make some searches using xpath on the DocumentFragment object.

I tried to do something simple like extracting all h1 tags but no matter
what I do I always get 0 results.

What is the relation between DocumentFragment and XPath?

Is it even possible to use XPaths on DocumentFragment object?

  public ParseResult filter(Content content, ParseResult parseResult,
  HTMLMetaTags metaTags, DocumentFragment doc)
  {
  Parse parse = parseResult.get(content.getUrl());
  Metadata metadata = parse.getData().getParseMeta();

  XPathFactory factory = XPathFactory.newInstance();
  XPath xpath = factory.newXPath();

  try
  {
  XPathExpression expr = xpath.compile(//h1);
  Object result = expr.evaluate(d, XPathConstants.NODESET);

  NodeList nodes = (NodeList) result;

  System.out.println(Found  + nodes.getLength() +  matches!);

  for (int i = 0; i  nodes.getLength(); i++)
  {
  System.out.println(nodes.item(i).getNodeValue());
  }

  }
  catch (XPathExpressionException e)
  {
  System.out.println(Error:  + e);
  }

  return parseResult;
  }

Thanks,
Eran