date:20070405

Nutch changes 0.9.txt

2007-04-05 Thread Paul Liddelow


Hi

Does anybody know what this means exactly:

8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
   in parse-plugins.xml (Chris A. Mattmann via siren)

In my crawl log file it says:

Error parsing: 
http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found
for contentType=application/pdf
url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf

This maybe a stupid question, but does the Nutch crawler only retrieve
and index links i.e. URL's and not pdf's? The .pdf isn't in the
crawl-urlfilter.txt file either. And I can see it in the
parse-plugins.xml file:





Thanks
Paul

Re: Help please trying to crawl local file system

2007-04-05 Thread Dennis Kubes

Did you set the agent name in the nutch configuration.  I think even 
when crawling only the local file system the agent name still needs to 
be set.  If not set I believe nothing is fetched and errors are thrown 
but you would only see this if your logging was setup for it.


Dennis Kubes

jim shirreffs wrote:
I googled and googled and goolged I am trying to crawl my local file 
system and can't seem to get it right.


I use this command

bin/mutch crawl urls -dir crawl

My urls dir contains one file (files) that looks like this

file:///c:/joms

c:/joms exists

I've modified the config file crawl-urlfilter.txt

#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):

# skip everything else . web spaces
#-.
+.*


And the config file nutch-site.xml adding


 plugin.includes
 protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic 




 file.content.limit
 -1




And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ 



# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.


I don't get any errors but nothing gets crawled either. If anyone can 
point out my mistake(s) I would greatly appreciate it.


thanks in advance

jim s


ps it would also be nice to know this email is getting into the 
nutch-users mailing list

Nutch 0.9 officially released!

2007-04-05 Thread Chris Mattmann

Hi Folks,

 After some hard work from all folks involved, we've managed to push out
Apache Nutch, release 0.9. This is the second release of Nutch based
entirely on the underlying Hadoop platform. This release includes several
critical bug fixes, as well as key speedups described in more detail at Sami
Siren's blog:

 http://blog.foofactory.fi/2007/03/twice-speed-half-size.html

 See the list of changes made in this version:

 http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt

The release is available here.

 http://www.apache.org/dyn/closer.cgi/lucene/nutch/

 Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes,
Sami Siren, and the rest of the Nutch development team for providing lots of
help along the way, and for allowing me to be the release manager! Enjoy the
new release!

Cheers,
  Chris

Re: Unable to load native-hadoop library

2007-04-05 Thread wangxu


yeah,it is 32-bit,and it is 1.5.0_04 JDK.
lots of the commands throw this warn information.

just for example,
bin/nutcher readdb nutcherdata/test/crawl/crawldb/ -stats

it says
2007-04-06 08:58:09,992 WARN  util.NativeCodeLoader 
(NativeCodeLoader.java:(51)) - Unable to load native-hadoop 
library for your platform... using built in-java classes where applicable



Andrzej Bialecki wrote:

wangxu wrote:
Linux wangxu.com 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 
GNU/Linux


Andrzej Bialecki wrote:

wangxu wrote:

when I use nutch-nightly0.9 ,I got this:

Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable


And I echo $JAVA_LIBRARY_PATH,then I got:

JAVA_LIBRARY_PATH: nutch/lib/native/Linux-i386-32

How can I correct it?


(Please send Nutch-related questions first to Nutch groups).

What is your operating system (uname -a) ?


Currently, native libs are available only for 32-bit JVMs - so if you 
are running a 64-bit JVM it won't work. Also, I assume you are using a 
Sun JDK 1.5 or newer.


If all of the above is correct, then you could try to send us the 
complete command that the bin/nutch script comes up with - simply echo 
the last command just before it executes, and copy this.

Re: Run Job Crashing

2007-04-05 Thread jim shirreffs


Figured this one out, just in case some other newbe has the same problem.

Windows places hidden files in the urls dir if one customizes the folder 
view. These files must be removed or Nutch thinks they url files and 
processes them. One the hidden files are removed all is well.


jim s



anyone else has
- Original Message - 
From: "jim shirreffs" <[EMAIL PROTECTED]>

To: "nutch lucene apache" 
Sent: Thursday, April 05, 2007 11:51 AM
Subject: Run Job Crashing



Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll  nov/2004 and gygwin1 latest release


Very strange, ran the crawler once

S bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and everything worked until this error


Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
   at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)


Tried running the crawler again

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and now I consistantly get this error

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

I have one file localhost in my url dir and it looks like this

http://localhost

My  crawl-urlfiler.xml looks like this

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/

# skip everything else

My nutch-site.xml looks like this








 http.agent.name
 RadioCity
 



 http.agent.description
 nutch web crawler
 



 http.agent.url
 www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch
 



 http.agent.email
 jpsb at flash.net
 




I am getting the same behavor on two separate hosts.  If anyone can 
suggest what I might be doing wrong I would greatly appreicate it.


jim s

PS tried to mail from a different host but did not see message in mailing 
list.  Hope only this messages gets into mailing list.

Help please trying to crawl local file system

2007-04-05 Thread jim shirreffs

I googled and googled and goolged I am trying to crawl my local file system 
and can't seem to get it right.


I use this command

bin/mutch crawl urls -dir crawl

My urls dir contains one file (files) that looks like this

file:///c:/joms

c:/joms exists

I've modified the config file crawl-urlfilter.txt

#-^(file|ftp|mailto|sw|swf):
-^(http|ftp|mailto|sw|swf):

# skip everything else . web spaces
#-.
+.*


And the config file nutch-site.xml adding


 plugin.includes
 
protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic


 file.content.limit
 -1




And lastly I've modified regex-urlfilter.txt
#file systems
+^file:///c:/top/directory/
-.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.


I don't get any errors but nothing gets crawled either. If anyone can point 
out my mistake(s) I would greatly appreciate it.


thanks in advance

jim s


ps it would also be nice to know this email is getting into the nutch-users 
mailing list

Run Job Crashing

2007-04-05 Thread jim shirreffs


Nutch-0.8.1
Windows 2000/Windows XP
Java 1.6
cygwin1.dll  nov/2004 and gygwin1 latest release


Very strange, ran the crawler once

S bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and everything worked until this error


Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070404094549
Indexer: adding segment: crawl/segments/20070404095026
Indexer: adding segment: crawl/segments/20070404095504
Optimizing index.
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
   at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)


Tried running the crawler again

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

and now I consistantly get this error

$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
run java in NUTCH_JAVA_HOME D:\java\jdk1.6
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

I have one file localhost in my url dir and it looks like this

http://localhost

My  crawl-urlfiler.xml looks like this

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|swf|sw):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*localhost/

# skip everything else

My nutch-site.xml looks like this








 http.agent.name
 RadioCity
 



 http.agent.description
 nutch web crawler
 



 http.agent.url
 www.RadioCity.dynip.com/RadioCity/HtmlPages/Nutch
 



 http.agent.email
 jpsb at flash.net
 




I am getting the same behavor on two separate hosts.  If anyone can suggest 
what I might be doing wrong I would greatly appreicate it.


jim s

PS tried to mail from a different host but did not see message in mailing 
list.  Hope only this messages gets into mailing list.

Re: Using nutch as a web crawler

2007-04-05 Thread Lourival Júnior


Nutch has a file called crawl-urlfilter.txt where you can set your site
domain or site list, so nutch will only crawl this list. Download nutch and
see it working, is better for you :). Take a look:
http://lucene.apache.org/nutch/tutorial8.html

Regards,

On 4/5/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:


Thanks. Can you please tell me how can I plugin in my own handling
when nutch sees a site instead of building the search database for
that site?



On 4/3/07, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> I have total certainty that nutch is what are you looking for. Take a
look
> to nutch's documentation for more details and you will see :).
>
> On 4/3/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I would like to know if know if it is a good idea to use nutch web
> > carwler?
> > Basically, this is what I need:
> > 1. I have a list of web site
> > 2. I want the web crawler to go thru each site, parser the anchor. if
> > it is the same domain, go thru the same step for 3 level.
> > 3. For each link, write to a new file.
> >
> > Is nutch a good solution? or there is other better open source
> > alternative for my purpose?
> >
> > Thank you.
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [EMAIL PROTECTED]
>





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

RE: help needed on filters

2007-04-05 Thread Gal Nitzan

All your REGEX looks fine however I would do the following:

^- http://([a-z0-9]*\.)*example.com/stores/.*/merch
#ignore anything with ? in it
^- http://([a-z0-9]*\.)*example.com.*\?

#allow only home page
^+ http://([a-z0-9]*\.)*example.com/$
#allow only htm file
^+ http://([a-z0-9]*\.)*example.com/.*?\.htm
#allow only do file
^+ http://([a-z0-9]*\.)*example.com/.*?\.do

HTH,

Gal.

> -Original Message-
> From: cha [mailto:[EMAIL PROTECTED]
> Sent: Thursday, April 05, 2007 10:34 AM
> To: nutch-user@lucene.apache.org
> Subject: help needed on filters
>
>
> Hi,
>
> I want to crawl only .htm,.html and .do pages from my web-site.Secondly I
> want to ignore the following urls from crawling
>
> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
> http://www.example.com/stores/abcd/merch-cats/abcd.*
> http://www.example.com/stores/abcd/merch/abd.*
>
> I have set all the filters in regex-urlfilters and crawl-urlfilter files.
>
> Follwing is just the code which fulfill my purpose :
>
> # skip URLs containing certain characters as probable queries, etc.
> -^http://www.example.com/stores/.*/merch.*
>
> # accept hosts in MY.DOMAIN.NAME
>
> +^http://([a-z0-9]*\.)*example.com/.*\.htm$
> +http://([a-z0-9]*\.)*example.com/.*\.do
> +http://([a-z0-9]*\.)*example.com/$
>
>
> Its crawl all the required pages correctly the only problem I get is
> getting
> ? or some other characters after htm. So i pass the htm$.
>
> But after giving that it is not crawling the merchant pages & neglect
> lotsa
> of urls , which i require.
>
> So dont know what to do??
>
> Please let me know with your valuable suggestions.
>
> Cheers,
> Cha
> --
> View this message in context: http://www.nabble.com/help-needed-on-
> filters-tf3530069.html#a9851344
> Sent from the Nutch - User mailing list archive at Nabble.com.

Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread Andrzej Bialecki


[EMAIL PROTECTED] wrote:

Hi Enis,




Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to
re-appear in one of the subsequent fetches.  Well, it won't
re-appear, because it will remain missing, but it would be great to
be able to tell Nutch to "forget it" "from everywhere".  Is that
doable? I could read and re-write the *Db Maps, but that's a lot of
IO... just to get a couple of URLs erased.  I'd prefer a friendly
persuasion where Nutch flags a given page as "forget this page as
soon as possible" and it just happens later on.


Somehow you need to flag those pages, and keep track of them, so they 
have to remain CrawlDb.


The simplest way to do this is, I think, through a scoring filter API - 
you can add your own filter, which during updatedb operation flags 
unwanted urls (by means of putting a piece of metadata in CrawlDatum), 
and then during the generate step it checks this metadata and returns 
the generateScore = Float.MIN_VALUE - which means this page will never 
be selected for fetching as long as there are other unfetched pages.


You can also modify the Generator to completely skip such flagged pages.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread Enis Soztutar


Andrzej Bialecki wrote:

[EMAIL PROTECTED] wrote:

Hi Enis,




Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to
re-appear in one of the subsequent fetches.  Well, it won't
re-appear, because it will remain missing, but it would be great to
be able to tell Nutch to "forget it" "from everywhere".  Is that
doable? I could read and re-write the *Db Maps, but that's a lot of
IO... just to get a couple of URLs erased.  I'd prefer a friendly
persuasion where Nutch flags a given page as "forget this page as
soon as possible" and it just happens later on.


Somehow you need to flag those pages, and keep track of them, so they 
have to remain CrawlDb.


The simplest way to do this is, I think, through a scoring filter API 
- you can add your own filter, which during updatedb operation flags 
unwanted urls (by means of putting a piece of metadata in CrawlDatum), 
and then during the generate step it checks this metadata and returns 
the generateScore = Float.MIN_VALUE - which means this page will never 
be selected for fetching as long as there are other unfetched pages.


You can also modify the Generator to completely skip such flagged pages.

Maybe we should permanently remove the urls that failed fetching k times 
from the crawldb, during updatedb operation. Since the web is highly 
dynamic there can be as many gone sites as new sites(or slightly less).  
As far as i know  once a url is entered to the crawldb it will stay 
there with one of the possible states : STATUS_DB_UNFETCHED,  
STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?


This way Otis's case will also be resolved.

Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Enis Soztutar

Great work, could you just post these into the nutch wiki as a step by 
step tutorial to new comers.


zzcgiacomini wrote:
I have spent sometime playing with nutch-0 and collecting notes from 
the mailing lists ...
may be someone will find these notes useful end could point me out  
mistakes

I am not at all a nutch expert...
-Corrado






 0) CREATE NUTCH USER AND GROUP

Create a nutch user and group and perform all the following logged in as 
nutch user.
put this line in your .bash_profile

export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

downloaded the nutch and hadoop trunks as well explained on 
http://lucene.apache.org/hadoop/version_control.html

(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

Ex: 


Build and produce the tar file
cd hadoop/trunk
ant tar

To build hadoop with native libraries 64bits proceed as follow :

A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old 


tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install

B) compile native 64bit libs for hadoop  if needed

cd hadoop/trunk/src/native

export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64

CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" 
./configure

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

in config.h replace the line

#define HADOOP_LZO_LIBRARY libnotfound.so 


with this one

#define HADOOP_LZO_LIBRARY "liblzo2.so"
make 


 3) BUILD NUTCH

nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar 


mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar

 4) INSTALL

copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
In my case I only have two identical machines available called myhost2 and myhost1. 

On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop 
distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10



on both machines create the directory:
mkdir /disk10/hadoopFs/ 


copy hadoop 64bit native libraries  if needed

mkdir /opt/nutch/lib/native/Linux-x86_64

cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.

A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used 


   fs.default.name : myhost1.mydomain.org:9010
   mapred.job.tracker  : myhost1.mydomain.org:9011
   mapred.map.tasks: 40
   mapred.reduce.tasks : 3
   dfs.name.dir: /opt/hadoopFs/name
   dfs.data.dir: /opt/hadoopFs/data
   mapred.system.dir   : /opt/hadoopFs/mapreduce/system
   mapred.local.dir: /opt/hadoopFs/mapreduce/local
   dfs.replication : 2

   "The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
This should be a multiple of the number of computers that you have. 
In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks.


   "The dfs.replication property states how many servers a single file 
should be
   replicated to before it becomes available.  Because we are using 2 servers I have set 
   this at 2. 


   may be you want also change nutch-site by adding  with a different value 
then the default of 3

   http.redirect.max   : 10

 
B) be sure that your  conf/slaves file contains the name of the slaves machines. In my cases:


   myhost1.mydomain.org
   myhost2.mydomain.org

C) create directories for pids and log files on both machines

   mkdir /opt/nutch/pids
   mkdir /opt/

Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread ogjunk-nutch

Hi Enis,

Right, I can easily delete the page from the Lucene index, though I'd prefer to 
follow the Nutch protocol and avoid messing something up by touching the index 
directly.  However, I don't want that page to re-appear in one of the 
subsequent fetches.  Well, it won't re-appear, because it will remain missing, 
but it would be great to be able to tell Nutch to "forget it" "from 
everywhere".  Is that doable?
I could read and re-write the *Db Maps, but that's a lot of IO... just to get a 
couple of URLs erased.  I'd prefer a friendly persuasion where Nutch flags a 
given page as "forget this page as soon as possible" and it just happens later 
on.

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Enis Soztutar <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Thursday, April 5, 2007 3:29:55 AM
Subject: Re: [Nutch-general] Removing pages from index immediately

Since hadoop's map files are write once, it is not possible to delete 
some urls from the crawldb and linkdb. The only thing you can do is to 
create the map files once again without the deleted urls. But running 
the crawl once more as you suggested seems more appropriate. Deleting 
documents from the index is just lucene stuff.

In your case it seems that every once in a while, you crawl the whole 
site, and create the indexes and db's and then just throw the old one 
out. And between two crawls you can delete the urls from the index.

[EMAIL PROTECTED] wrote:
> Hi,
>
> I'd like to be able to immediately remove certain pages from Nutch (index, 
> crawldb, linkdb...).
> The scenario is that I'm using Nutch to index a single site or a set of 
> internal sites.  Once in a while editors of the site remove a page from the 
> site.  When that happens, I want to update at least the index and ideally 
> crawldb, linkdb, so that people searching the index don't get the missing 
> page in results and end up going there, hitting the 404.
>
> I don't think there is a "direct" way to do this with Nutch, is there?
> If there really is no direct way to do this, I was thinking I'd just put the 
> URL of the recently removed page into the first next fetchlist and then 
> somehow get Nutch to immediately remove that page/URL once it hits a 404.  
> How does that sound?
>
> Is there a way to configure Nutch to delete the page after it gets a 404 for 
> it even just once?  I thought I saw the setting for that somewhere a few 
> weeks ago, but now I can't find it.
>
> Thanks,
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
>
>
>   

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

help needed on filters

2007-04-05 Thread cha


Hi,

I want to crawl only .htm,.html and .do pages from my web-site.Secondly I
want to ignore the following urls from crawling

http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
http://www.example.com/stores/abcd/merch-cats/abcd.*
http://www.example.com/stores/abcd/merch/abd.*

I have set all the filters in regex-urlfilters and crawl-urlfilter files.

Follwing is just the code which fulfill my purpose :

# skip URLs containing certain characters as probable queries, etc.
-^http://www.example.com/stores/.*/merch.*

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*example.com/.*\.htm$
+http://([a-z0-9]*\.)*example.com/.*\.do
+http://([a-z0-9]*\.)*example.com/$


Its crawl all the required pages correctly the only problem I get is getting
? or some other characters after htm. So i pass the htm$.

But after giving that it is not crawling the merchant pages & neglect lotsa
of urls , which i require.

So dont know what to do??

Please let me know with your valuable suggestions.

Cheers,
Cha
-- 
View this message in context: 
http://www.nabble.com/help-needed-on-filters-tf3530069.html#a9851344
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Tomi N/A


2007/4/5, Enis Soztutar <[EMAIL PROTECTED]>:

Great work, could you just post these into the nutch wiki as a step by
step tutorial to new comers.


Exactly what I wanted to say, both points. :)

Cheers,
t.n.a.

Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread ogjunk-nutch

Corrado,

Would it be possible for you to add this to the Wiki?

Also, there are several other tutorials:
  http://lucene.apache.org/nutch/tutorial8.html
  http://wiki.apache.org/nutch/NutchTutorial
  http://wiki.apache.org/nutch/NutchHadoopTutorial

Maybe you can combine them?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: zzcgiacomini <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Wednesday, April 4, 2007 10:53:54 AM
Subject: [Nutch-general] Nutch Step by Step Maybe someone will find this useful 
?

I have spent sometime playing with nutch-0 and collecting notes from the 
mailing lists ...
may be someone will find these notes useful end could point me out  
mistakes
I am not at all a nutch expert...
-Corrado

 



 0) CREATE NUTCH USER AND GROUP

Create a nutch user and group and perform all the following logged in as 
nutch user.
put this line in your .bash_profile

export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

downloaded the nutch and hadoop trunks as well explained on 
http://lucene.apache.org/hadoop/version_control.html
(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

Ex: 

Build and produce the tar file
cd hadoop/trunk
ant tar

To build hadoop with native libraries 64bits proceed as follow :

A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old 

tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install

B) compile native 64bit libs for hadoop  if needed

cd hadoop/trunk/src/native

export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64

CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" 
./configure

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

in config.h replace the line

#define HADOOP_LZO_LIBRARY libnotfound.so 

with this one

#define HADOOP_LZO_LIBRARY "liblzo2.so"
make 

 3) BUILD NUTCH

nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want 
to put the last nightly build hadoop jar 

mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar

 4) INSTALL

copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
In my case I only have two identical machines available called myhost2 and 
myhost1. 

On each of them I have installed nutch binaries under /opt/nutch while I 
have dicided to have the hadoop 
distributed filesystem in a directory called hadoopFs located under a large 
disk munted on /disk10


on both machines create the directory:
mkdir /disk10/hadoopFs/ 

copy hadoop 64bit native libraries  if needed

mkdir /opt/nutch/lib/native/Linux-x86_64
cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.

A) on both the machines change the conf/hadoop-site.xml configuration file. 
Here are values I have used 

   fs.default.name : myhost1.mydomain.org:9010
   mapred.job.tracker  : myhost1.mydomain.org:9011
   mapred.map.tasks: 40
   mapred.reduce.tasks : 3
   dfs.name.dir: /opt/hadoopFs/name
   dfs.data.dir: /opt/hadoopFs/data
   mapred.system.dir   : /opt/hadoopFs/mapreduce/system
   mapred.local.dir: /opt/hadoopFs/mapreduce/local
   dfs.replication : 2

   "The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
This should be a multiple of the number of computers that you have. 
In our case since we are starting out with 2 computer we will have 4 
map and 4 reduce tasks.

   "The dfs.replication property states how many servers a single file 
should be
   replicated to before it becomes available.  Because we are using 2 
servers I have se

Re: Removing pages from index immediately

2007-04-05 Thread Enis Soztutar

Since hadoop's map files are write once, it is not possible to delete 
some urls from the crawldb and linkdb. The only thing you can do is to 
create the map files once again without the deleted urls. But running 
the crawl once more as you suggested seems more appropriate. Deleting 
documents from the index is just lucene stuff.


In your case it seems that every once in a while, you crawl the whole 
site, and create the indexes and db's and then just throw the old one 
out. And between two crawls you can delete the urls from the index.


[EMAIL PROTECTED] wrote:

Hi,

I'd like to be able to immediately remove certain pages from Nutch (index, 
crawldb, linkdb...).
The scenario is that I'm using Nutch to index a single site or a set of 
internal sites.  Once in a while editors of the site remove a page from the 
site.  When that happens, I want to update at least the index and ideally 
crawldb, linkdb, so that people searching the index don't get the missing page 
in results and end up going there, hitting the 404.

I don't think there is a "direct" way to do this with Nutch, is there?
If there really is no direct way to do this, I was thinking I'd just put the 
URL of the recently removed page into the first next fetchlist and then somehow 
get Nutch to immediately remove that page/URL once it hits a 404.  How does 
that sound?

Is there a way to configure Nutch to delete the page after it gets a 404 for it 
even just once?  I thought I saw the setting for that somewhere a few weeks 
ago, but now I can't find it.

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out

2007-04-05 Thread Ratnesh,V2Solutions India


HI,
What I can suggest you, at this moment is try to read the properties value
of default.xml and find out which property deals with Server socket
connection, then only you will be able to mention that property value in you
nutch-site.xml.

I havn't had done much with this.But will update if I get something
related with this issue.

Regards,
Ratnesh, V2Solutions India

cha wrote:
> 
> HI Ratnesh,
> 
> I am crawling the internet. I am able to get all the crawl pages but this
> error do appear in my error log..I dont know what it mean for. I have used
> two filter regex and crawl for my crawling..Is something do with that??
> 
> How should i eliminate the above menitioned error.Something need to be set
> or modified in nutch-site.xml?
> 
> Cheers,
> cha
> 
> Ratnesh,V2Solutions India wrote:
>> 
>> This socket exception normally comes , if fetcher is not able to get the
>> page to crawl??
>> I mean there is some problem with the server connection.
>> if you r crawling for local stored pages, then check whether the server
>> is started or not??
>> 
>> I have tested the same for my local crawl, but for internet specific
>> crawl I don't have enough idea??
>> 
>> 
>> Ratnesh V2Solutions India
>> 
>> 
>> cha wrote:
>>> 
>>> HI ppl,
>>> 
>>> when i crawl my website , it is giving me following error , though
>>> crawling is doing fine.
>>> 
>>> Can anyone tell me what the error is about?? Do i have to set anything
>>> in nutch-site.xml??
>>> 
>>> Following  are the error logs:
>>> 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException:
>>> Read timed out 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.net.SocketInputStream.socketRead0(Native Method) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.net.SocketInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.BufferedInputStream.read1(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.BufferedInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.FilterInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.PushbackInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.FilterInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214)
>>>  
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) 
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
>>>  
>>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException:
>>> Read timed out 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.net.SocketInputStream.socketRead0(Native Method) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.net.SocketInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.BufferedInputStream.read1(Unknown Source) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.BufferedInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.FilterInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.PushbackInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> java.io.FilterInputStream.read(Unknown Source) 
>>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>>> org.apache.nutch.protocol.http.Http:? at
>>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214)
>>>  
>>> [20

Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out

2007-04-05 Thread cha


HI Ratnesh,

I am crawling the internet. I am able to get all the crawl pages but this
error do appear in my error log..I dont know what it mean for. I have used
two filter regex and crawl for my crawling..Is something do with that??

How should i eliminate the above menitioned error.Something need to be set
or modified in nutch-site.xml?

Cheers,
cha

Ratnesh,V2Solutions India wrote:
> 
> This socket exception normally comes , if fetcher is not able to get the
> page to crawl??
> I mean there is some problem with the server connection.
> if you r crawling for local stored pages, then check whether the server is
> started or not??
> 
> I have tested the same for my local crawl, but for internet specific crawl
> I don't have enough idea??
> 
> 
> Ratnesh V2Solutions India
> 
> 
> cha wrote:
>> 
>> HI ppl,
>> 
>> when i crawl my website , it is giving me following error , though
>> crawling is doing fine.
>> 
>> Can anyone tell me what the error is about?? Do i have to set anything in
>> nutch-site.xml??
>> 
>> Following  are the error logs:
>> 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException:
>> Read timed out 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.net.SocketInputStream.socketRead0(Native Method) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.net.SocketInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.BufferedInputStream.read1(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.BufferedInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.FilterInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.PushbackInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.FilterInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214)
>>  
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) 
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
>>  
>> [2007-04-04 16:23:21,218] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:144) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? java.net.SocketTimeoutException:
>> Read timed out 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.net.SocketInputStream.socketRead0(Native Method) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.net.SocketInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.BufferedInputStream.read1(Unknown Source) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.BufferedInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.FilterInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:22,046] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.PushbackInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> java.io.FilterInputStream.read(Unknown Source) 
>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:214)
>>  
>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:146) 
>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) 
>> [2007-04-04 16:23:22,062] [FetcherThread] ERROR
>> org.apache.nutch.protocol.http.Http:? at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:208)
>>  
>> [2007-0

Nutch changes 0.9.txt

Re: Help please trying to crawl local file system

Nutch 0.9 officially released!

Re: Unable to load native-hadoop library

Re: Run Job Crashing

Help please trying to crawl local file system

Run Job Crashing

Re: Using nutch as a web crawler

RE: help needed on filters

Re: [Nutch-general] Removing pages from index immediately

Re: [Nutch-general] Removing pages from index immediately

Re: Nutch Step by Step Maybe someone will find this useful ?

Re: [Nutch-general] Removing pages from index immediately

help needed on filters

Re: Nutch Step by Step Maybe someone will find this useful ?

Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?

Re: Removing pages from index immediately

Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out

Re: ERROR org.apache.nutch.protocol.http.Http:?java.net.SocketTimeoutException: Read timed out

19 matches

Site Navigation

Mail list logo

Footer information