Boost urls to crawl by anchor text
Hi all, I've created a custom scoring filter plugin which overrides the ScoringFilter class. My main goal is once a certain page is fetched and parsed, I wish to analyze it's outlinks and decide to which links to go next. One of the criterias which help me decide - is the link anchor text. For example, if a certain link from the current page has an anchor text that contain the word Games I which to boost it so it will be fetched on the next round. From what I've seen, the *updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, ListCrawlDatum inlinked)* function receives only the URL text and I have no access to the URL anchor text - any idea how I can get the anchor text of a certain URL in the updateDbScore function? Thanks, Eran
Nutch Hadoop 0.20 - AlreadyBeingCreatedException
Hi, I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of Nutch parser reduce tasks. I know this is a known issue with Nutch ( https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058 ) And as far as I can see that patch wasn't committed yet because we wanted to examine it on the new Hadoop 0.20 version. I am using latest Nutch with Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it does) - maybe we should commit the change? Thanks, Eran
Re: Nutch Hadoop 0.20 - Exception
Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh I don't know what to do! I've tried all kind of stuff but with no luck... :( *hadoop-eran-jobtracker-master.log* 2009-12-09 12:04:53,965 FATAL mapred.JobTracker - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1610) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:180) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:172) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3699) *hadoop-eran-namenode-master.log* 2009-12-09 12:04:27,583 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Thanks for trying to help, Eran On Sun, Dec 6, 2009 at 3:51 PM, Eran Zinman zze...@gmail.com wrote: Hi, Just upgraded to the latest version of Nutch with Hadoop 0.20. I'm getting the following exception in the namenode log and DFS doesn't start: 2009-12-06 15:48:32,523 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Any help will be appreciated ... quite stuck with this. Thanks, Eran
Re: Nutch Hadoop 0.20 - Exception
Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Thanks, Eran On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - Exception
Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Thanks, Eran On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System
Re: Nutch Hadoop 0.20 - Exception
Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4
Re: Nutch Hadoop 0.20 - Exception
Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant package. 5) Run the new Nutch... and got this error. What did I miss? Thanks, Eran On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote: Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem
Re: Nutch Hadoop 0.20 - Exception
Hi all, thanks Dennis - you helped me solve the problem. The problem was that I had two versions of jetty in my lib folder. I deleted the old version and viola - it works. The problem is that both versions exist in the SVN! Altough I took a fresh copy of the SVN I had both versions in my lib folder. I think we need to remove the old version from the SVN so people like me won't get confused ... Thanks ! Eran. On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote: Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant package. 5) Run the new Nutch... and got this error. What did I miss? Thanks, Eran On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote: Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs
Nutch Hadoop 0.20 - Exception
Hi, Just upgraded to the latest version of Nutch with Hadoop 0.20. I'm getting the following exception in the namenode log and DFS doesn't start: 2009-12-06 15:48:32,523 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Any help will be appreciated ... quite stuck with this. Thanks, Eran
Nutch - create my own repository
Hi, I'm developing my own set of tools, plugins and some minor code changes to Nutch. I still want to get updates from the main Nutch repository, but I would like to keep my own SVN for tracking my local code changes. I'm using normal shell SVN (I have no expirence with GIT) to track my changes. My question is - can I create a branch from the main repository to my own repository, which will only track my changes and keep getting updates from Nutch main repository with easy merge? Thanks, Eran
Re: Efficient focused crawling
Thanks for your help MillBii! I will definitely try the squareroot option - but is that only valid for outlinks or also affects pages linking to the page? Did you try implementing automatic Regex generation? I'm doing focused crawling but I'm also thinking about scaling it in the future. Also I will be happy to know if anyone else have any other suggestion (or already implemented strategy) - I think this issue affects most of the Nutch community - at least people that use Nutch for focused crawling. Thanks, Eran On Fri, Nov 27, 2009 at 8:29 PM, MilleBii mille...@gmail.com wrote: Well I have created for my own application is topical-scoring plugin : 1. first I needed to score the pages after parsing based on my regular expression 2. then I searched several options on to how boost score of that pages... I have only found a way to boost the score of the outlinks of these pages that have content which I wanted. Not perfect but so be it there is a high likelyhood in my case that adjacent pages have also content which I want. 3. then how to boost the score... this took me a while to figure out, I leave you all the options I tried. The good comprise I found is the following: if the page has content I want and score 1.0f than score= squareroot(score)... in this way you are adding weight to the pages which have content you are looking (since score is usually below 1. squareroot(x) is bigger than x). Of course there are some down side to that approach, it is more difficult to get the crawler to go outsides sites that have content your are looking for, it is a bit like digging a hole and until you have finished the hole it will get the crawler to explore it... experimentally I have found that it works nicely for me though, if you limit the nbre of URLS per site it won't spend it's life on them. We could try to generalize this plug-in by putting the regular expression as as config item because that is really the only thing which is specific to my application I believe. 2009/11/27 Eran Zinman zze...@gmail.com Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency. I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a certain URL contains what I'm looking for unless I parse it and do some analysis on it. Basically I was thinking about two methods to improve crawling efficiency: 1) Whenever a page is found which contains the data I'm looking for, improve overall score for all pages linking to it (and pages linking to them and so on...), assuming they have other links that point to content I'm looking for. 2) Once I already found several pages that contain relevant data - create a Regex automatically to match new urls which might contain usable content. I've started to read about the OPIC-score plugin but was unable to understand if it can help me or not with issue no. 1. Any idea guys? I will be very grateful for any help or things that can point me in the right direction. Thanks, Eran -- -MilleBii-
Re: Efficient focused crawling
Hi MilleBii, I think you misinterpreted what I've meant. 1. Regarding Regex - I know I can build a Regex beforehand to identify URLs, but I will have to create one manually for each domain I'm crawling - not scalable. I'm looking for a way to build Regex automatically using automatic machine learning. I know to identify if a certain page contains the content I'm looking for only after I parse it. I want my crawler to create automatic Regex patterns based on it's crawling experience. 2. I want to boost inlinks not necessarily to crawl them again, but to crawl in higher priority other links they link to, taking under assumption these links might contain the content I'm looking for. Thanks for your help! Eran On Sat, Nov 28, 2009 at 10:56 AM, MilleBii mille...@gmail.com wrote: oops : why it shouldn't work for others. 2009/11/28 MilleBii mille...@gmail.com I just use the Java build-in regex features... and therefore just supplied the string, which I design for my case using RegexBuddy a really great tool by the way. Pay attention though at static creation in order to avoid regex creation at each plug-in load and run-time hit. Didn't find a way to modify inlinks... on the other hand inlinks you have gone through already when you are evaluating a given page so I did not bother and it works fine for me, I don't see why it should work for others. 2009/11/28 Eran Zinman zze...@gmail.com Thanks for your help MillBii! I will definitely try the squareroot option - but is that only valid for outlinks or also affects pages linking to the page? Did you try implementing automatic Regex generation? I'm doing focused crawling but I'm also thinking about scaling it in the future. Also I will be happy to know if anyone else have any other suggestion (or already implemented strategy) - I think this issue affects most of the Nutch community - at least people that use Nutch for focused crawling. Thanks, Eran On Fri, Nov 27, 2009 at 8:29 PM, MilleBii mille...@gmail.com wrote: Well I have created for my own application is topical-scoring plugin : 1. first I needed to score the pages after parsing based on my regular expression 2. then I searched several options on to how boost score of that pages... I have only found a way to boost the score of the outlinks of these pages that have content which I wanted. Not perfect but so be it there is a high likelyhood in my case that adjacent pages have also content which I want. 3. then how to boost the score... this took me a while to figure out, I leave you all the options I tried. The good comprise I found is the following: if the page has content I want and score 1.0f than score= squareroot(score)... in this way you are adding weight to the pages which have content you are looking (since score is usually below 1. squareroot(x) is bigger than x). Of course there are some down side to that approach, it is more difficult to get the crawler to go outsides sites that have content your are looking for, it is a bit like digging a hole and until you have finished the hole it will get the crawler to explore it... experimentally I have found that it works nicely for me though, if you limit the nbre of URLS per site it won't spend it's life on them. We could try to generalize this plug-in by putting the regular expression as as config item because that is really the only thing which is specific to my application I believe. 2009/11/27 Eran Zinman zze...@gmail.com Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency. I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a certain URL contains what I'm looking for unless I parse it and do some analysis on it. Basically I was thinking about two methods to improve crawling efficiency: 1) Whenever a page is found which contains the data I'm looking for, improve overall score for all pages linking to it (and pages linking to them and so on...), assuming they have other links that point to content I'm looking for. 2) Once I already found several pages that contain relevant data - create a Regex automatically to match new urls which might contain usable content. I've started to read about the OPIC-score plugin but was unable to understand if it can help me or not with issue no. 1. Any idea guys? I will be very grateful for any help or things that can point me in the right direction. Thanks, Eran -- -MilleBii- -- -MilleBii- -- -MilleBii-
Efficient focused crawling
Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency. I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a certain URL contains what I'm looking for unless I parse it and do some analysis on it. Basically I was thinking about two methods to improve crawling efficiency: 1) Whenever a page is found which contains the data I'm looking for, improve overall score for all pages linking to it (and pages linking to them and so on...), assuming they have other links that point to content I'm looking for. 2) Once I already found several pages that contain relevant data - create a Regex automatically to match new urls which might contain usable content. I've started to read about the OPIC-score plugin but was unable to understand if it can help me or not with issue no. 1. Any idea guys? I will be very grateful for any help or things that can point me in the right direction. Thanks, Eran
Re: Nutch - Focused crawling
Thanks Julien, I can confirm this patch works perfectly and does a good job of keeping a good crawl rate. We have doubled the rate of information retrieval by using a time limit on the fetch queue. Thanks, Eran On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I've separated both functionalities into separate patches on JIRA (NUTCH-769 / NUTCH-770). Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Julien Nioche lists.digitalpeb...@gmail.com Hi Eran, There is currently no time limit implemented in the Fetcher. We implemented one which worked quite well in combination with another mechanism which clears the URLs from a pool if more than x successive exceptions have been encountered. This limits cases where a site or domain is not responsive. I might try and submit a patch if I find the time next week, our code has been heavily modified with the previous patches which have not been committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need to spend a bit of time extracting this specific functionality from the rest. Best, Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/21 Eran Zinman zze...@gmail.com Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5 hours - making the slowest domain the bottleneck of the entire fetch process. I've read Ken Krugler document and he's describing the same problem: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ I'm wondering - does anyone have a suggestion on what's the best way to tackle this issue? I think that Ken suggested to limit the fetch time - for example say terminate after 1 hour, even if you are not done yet, is that feature available in Nutch? I will be happy to try and contribute code if required! Thanks, Eran
Nutch - Focused crawling
Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about 1500. 90% of all domains finish fetching after 30min, and the other 10% takes an additional 2.5 hours - making the slowest domain the bottleneck of the entire fetch process. I've read Ken Krugler document and he's describing the same problem: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ I'm wondering - does anyone have a suggestion on what's the best way to tackle this issue? I think that Ken suggested to limit the fetch time - for example say terminate after 1 hour, even if you are not done yet, is that feature available in Nutch? I will be happy to try and contribute code if required! Thanks, Eran
Re: Nutch Hadoop question
Hi All, Don't want to bother you guys too much... I've tried searching for this topic and do some testing myself but so far was quite unsuccessful. Basically - I wish to use some computers only for map-reduce processing and not for HDFS, does anyone know how this can be done? Thanks, Eran On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote: Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Re: Nutch Hadoop question
Thanks for the help guys. On Fri, Nov 13, 2009 at 5:20 PM, Andrzej Bialecki a...@getopt.org wrote: TuxRacer69 wrote: Hi Eran, mapreduce has to store its data on HDFS file system. More specifically, it needs read/write access to a shared filesystem. If you are brave enough you can use NFS, too, or any other type of filesystem that can be mounted locally on each node (e.g. a NetApp). But if you want to separate the two groups of servers, you could build two separate HDFS filesystems. To separate the two setups, you will need to make sure there is no cross communication between the two parts, You can run two separate clusters even on the same set of machines, just configure them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch Hadoop question
Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
including code between plugins
Hi, I've written my own plugin that's doing some custom parsing. I've needed language parsing in that plugin and the language-identifier plugin is wokring great for my needs. However, I can't use the language identifier plugin as it is, since I want to parse only a small portion of the webpage. I've used the language identifier functions and it worked great in eclipse, but when I try to compile my plugin I'm unable to compile it since it depends on the language-identifier source code. My question is - how can I include the language identifier code in my plugin code without actually using the language-identifier plugin? Thank you for your help! Thanks, Eran
Re: including code between plugins
Hi Andrzej, thank you so much! that worked like a charm! I've spent so much time trying to figure this out and you helped me solve it in 5 min! Thanks! Eran On Mon, Nov 2, 2009 at 11:13 AM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, I've written my own plugin that's doing some custom parsing. I've needed language parsing in that plugin and the language-identifier plugin is wokring great for my needs. However, I can't use the language identifier plugin as it is, since I want to parse only a small portion of the webpage. I've used the language identifier functions and it worked great in eclipse, but when I try to compile my plugin I'm unable to compile it since it depends on the language-identifier source code. My question is - how can I include the language identifier code in my plugin code without actually using the language-identifier plugin? You need to add the language-identifier plugin to the requires section in your plugin.xml, like this: requires import plugin=nutch-extensionpoints/ import plugin=language-identifier/ /requires -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Extract full urls from DOM
Hello everyone, I've created a plugin for Nutch 1.0 that extends the parser. This plugin extract several kinds of information from the document DOM. In some cases I need to extract an href of a certain link. The link in the DOM is still relative as it was originally written in the html document, so for example it might be a link with an href of /music. My question is - how can I make this link have an absolute url - for example make /music to http://www.example.com/music;? Thanks a lot, Eran
Re: Extract full urls from DOM
Ken, Thanks a lot! You solved my problem. Thanks, Eran On Thu, Oct 29, 2009 at 2:35 PM, Ken Krugler kkrugler_li...@transpac.comwrote: On Oct 29, 2009, at 4:00am, Eran Zinman wrote: Hello everyone, I've created a plugin for Nutch 1.0 that extends the parser. This plugin extract several kinds of information from the document DOM. In some cases I need to extract an href of a certain link. The link in the DOM is still relative as it was originally written in the html document, so for example it might be a link with an href of /music. My question is - how can I make this link have an absolute url - for example make /music to http://www.example.com/music;? new URL(baseUrl, relativeString) will return the full URL, leaving aside a few minor edge cases. The baseUrl will be the URL of the containing document, or the value of the (potentially relative) location: response header field if it exists, or the value of the base tag in the head element, if that exists. -- Ken -- Ken Krugler TransPac Software, Inc. http://www.transpac.com +1 530-210-6378
Re: Plug-ins during Nutch Crawl
Hi Shreekanth Prabhu, I was facing the same problem. You don't need to recrawl all the urls from scratch. Remember - you have the segments already in your harddisk. It really depends where you've done the date parsing. If you done in as part of the parser you can run the parser again. If you done it just before indexing you job is even easier - you can just run the indexer again. Remember - each Nutch component can run as a standalone application. BTW - I'm really looking for a JAVA based date parser myself that can parse dates written by humans (or website owners) - is it possible for you to share the date parser or if you are using an open-source one, give me a recommendation on one I can use? Thanks, Eran On Wed, Oct 21, 2009 at 9:47 AM, sprabhu_PN shreekanth.pra...@pinakilabs.com wrote: We have added a few plug-ins such as date parsing plug-in that get exercised during a Nutch crawl and update a field in each index record. Now we find that we need to improve the plug-in and re-run it. Is the only option to crawl the whole index once again ? Is there any way we can do a recrawl which will just exercise newer versions of plug-ins and take less time to do it ? Thanks in advance. Regards Shreekanth Prabhu -- View this message in context: http://www.nabble.com/Plug-ins-during-Nutch-Crawl-tp25987956p25987956.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Combining parsed data from two sources before indexing
Hi, I'm also quite interested in this feature. I want to combine information from two different pages and I don't know which one will be downloaded first. Only when both are downloaded I want to process them. Thanks, Eran On Wed, Sep 9, 2009 at 12:51 AM, Max S maximillian...@googlemail.comwrote: Hi all, How can I combine parsed data from two sources before indexing them? At the moment, the way I see it (correct me if I'm wrong), each page (fetched) is treated as a separate document. These documents are related only by their inlinks / outlinks. What if there are contents that have been divided into a few web page. How do combine them together before indexing it? Regards Max S
DocuemntFragement and XPath
Hi, I've created a plugin on Nutch 1.0 that extends the HtmlParseFilter. I wanted to extract some more information from the HTML document. I've got all the parameters into the filter function and then I wanted to make some searches using xpath on the DocumentFragment object. I tried to do something simple like extracting all h1 tags but no matter what I do I always get 0 results. What is the relation between DocumentFragment and XPath? Is it even possible to use XPaths on DocumentFragment object? public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); XPathFactory factory = XPathFactory.newInstance(); XPath xpath = factory.newXPath(); try { XPathExpression expr = xpath.compile(//h1); Object result = expr.evaluate(d, XPathConstants.NODESET); NodeList nodes = (NodeList) result; System.out.println(Found + nodes.getLength() + matches!); for (int i = 0; i nodes.getLength(); i++) { System.out.println(nodes.item(i).getNodeValue()); } } catch (XPathExpressionException e) { System.out.println(Error: + e); } return parseResult; } Thanks, Eran