Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
During a crawl of about 3.8M tlds to a depth of 2, when I try to index the 
segments, I get the following error:

java.lang.StackOverflowError
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
Any help with this error would be much appreciated, I have encountered this 
before. 

here is the last 10 lines of the hadoop.log file:

tail -n 10 hadoop.log.2010-01-10
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$Ques.match(Pattern.java:3691)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$Ques.match(Pattern.java:3691)
2010-01-11 00:31:53,221 WARN  io.UTF8 - truncating long string: 62492 chars, 
starting with java.lang.StackOverf



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
bin/nutch -Xss1024k index crawl1/indexes crawl1/crawldb crawl1/linkdb 
crawl/segments/*
Exception in thread main java.lang.NoClassDefFoundError: index
Caused by: java.lang.ClassNotFoundException: index
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: index.  Program will exit.

Do you have to set the -Xss flag somewhere else?

Thanks, 

Eric 

On Jan 11, 2010, at 8:36 AM, Godmar Back wrote:

 Very intriguing, considering that we teach our students to avoid
 recursion where possible for this very reason.
 
 Googling reveals
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you
 could try increasing the Java stack size in bin/nutch (-Xss), or use
 an alternate regexp if you can.
 
 Just out of curiosity, why does a performance critical program such as
 Nutch use Sun's backtracking-based regexp implementation rather than
 an efficient Thompson-based one?  Do you need the additional
 expressiveness provided by PCRE?
 
 - Godmar
 
 On Mon, Jan 11, 2010 at 11:24 AM, Eric Osgood e...@lakemeadonline.com wrote:
 During a crawl of about 3.8M tlds to a depth of 2, when I try to index the 
 segments, I get the following error:
 
 java.lang.StackOverflowError
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
 Any help with this error would be much appreciated, I have encountered this 
 before.
 
 here is the last 10 lines of the hadoop.log file:
 
 tail -n 10 hadoop.log.2010-01-10
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$Ques.match(Pattern.java:3691)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$Ques.match(Pattern.java:3691)
 2010-01-11 00:31:53,221 WARN  io.UTF8 - truncating long string: 62492 chars, 
 starting with java.lang.StackOverf
 
 
 
 Eric Osgood
 -
 Cal Poly - Computer Engineering, Moon Valley Software
 -
 eosg...@calpoly.edu, e...@lakemeadonline.com
 -
 www.calpoly.edu/~eosgood, www.lakemeadonline.com
 
 

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
How do I set the bin/nutch stack size and the hadoop job stack size?

--Eric 

On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote:

 Also, put it in Hadoop settings for tasks...
 
 http://www.tokenizer.ca/
 
 
 -Original Message-
 From: Godmar Back [mailto:god...@gmail.com]
 Sent: January-11-10 11:53 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Help Needed with Error: java.lang.StackOverflowError
 
 On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com
 wrote:
 Do you have to set the -Xss flag somewhere else?
 
 Yes, in bin/nutch - looking for where it sets -Xmx
 
 - Godmar
 
 




Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
In the hadoop-env.sh, how do you add such options as -Xss, -Xms, -Xmx? 

--Eric 

On Jan 11, 2010, at 9:34 AM, Mischa Tuffield wrote:

 You can set it in hadoop-env.sh, and then run it. Or you could ad it to 
 your /etc/bashrc or the bashrc file of the user which runs hadoop. 
 
 Mischa
 On 11 Jan 2010, at 17:26, Eric Osgood wrote:
 
 How do I set the bin/nutch stack size and the hadoop job stack size?
 
 --Eric 
 
 On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote:
 
 Also, put it in Hadoop settings for tasks...
 
 http://www.tokenizer.ca/
 
 
 -Original Message-
 From: Godmar Back [mailto:god...@gmail.com]
 Sent: January-11-10 11:53 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Help Needed with Error: java.lang.StackOverflowError
 
 On Mon, Jan 11, 2010 at 11:50 AM, Eric Osgood e...@lakemeadonline.com
 wrote:
 Do you have to set the -Xss flag somewhere else?
 
 Yes, in bin/nutch - looking for where it sets -Xmx
 
 - Godmar
 
 
 
 
 
 ___
 Mischa Tuffield
 Email: mischa.tuffi...@garlik.com
 Homepage - http://mmt.me.uk/
 Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
 +44(0)20 8973 2465  http://www.garlik.com/
 Registered in England and Wales 535 7233 VAT # 849 0517 11
 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
 

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: ERROR: Too Many Fetch Failures

2009-11-20 Thread Eric Osgood
I have a 3-node cluster. I changed the solr server to one of the nodes  
rather than have the master node do both the master work and serve  
solr. I tried to crawl 100k urls again last and failed with too many  
fetch failures during the map and shuffle errors during the reduce.  
This just started happening - the only new additions to the cluster  
would be the solr server and adding a dell 2850 as a node. Here is my  
hadoop-site.xml


?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property


property
  namefs.default.name/name
  valuehdfs://opel:9000/value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
/property

property
  namemapred.job.tracker/name
  valueopel:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
/property

property
  namemapred.map.tasks/name
  value30/value
  description
define mapred.map tasks to be number of slave hosts
  /description
/property

property
  namemapred.reduce.tasks/name
  value6/value
  description
define mapred.reduce tasks to be number of slave hosts
  /description
/property

property
  namedfs.name.dir/name
  value/home/hadoop/filesystem/name/value
/property

property
 namefs.checkpoint.dir/name
 value/home/hadoop/filesystem/name2/value
 finaltrue/final
/property

property
  namedfs.data.dir/name
  value/home/hadoop/filesystem/data/value
/property

property
  namemapred.system.dir/name
  value/home/hadoop/filesystem/mapreduce/system/value
/property

property
  namemapred.local.dir/name
  value/home/hadoop/filesystem/mapreduce/local/value
/property

property
  namedfs.replication/name
  value3/value
/property

/configuration

Let me know if you need any other information - I have no idea how to  
fix this problem.


Thanks,

Eric

On Nov 20, 2009, at 1:30 AM, Julien Nioche wrote:

It was probably a one-off, network related problem. Can you tell us  
a bit

more about your cluster configuration?

2009/11/19 Eric Osgood e...@lakemeadonline.com


Julien,

Thanks for your help, how would I go about fixing this error now  
that it is

diagnosed?


On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote:

could be a communication problem between the node and the master.  
It is

not
a fetching problem in the Nutch sense of the term but a Hadoop- 
related

issue.

2009/11/19 Eric Osgood e...@lakemeadonline.com

This is the first time I have received this error while crawling.  
During

a
crawl of 100K pages, one of the nodes had a task failed and cited  
Too

Many
Fetch Failures as the reason. The job completed successfully but  
took

about
3 times longer than normal. Here is the log output


2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running  
child

java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)

at

org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

at java.io.FilterInputStream.close(FilterInputStream.java:155)
at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
at

org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)

at

org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_r_04_1

Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood 
http://www.calpoly.edu/%7Eeosgood,
www.lakemeadonline.com





--
DigitalPebble Ltd
http://www.digitalpebble.com








--
DigitalPebble Ltd
http://www.digitalpebble.com





ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood
This is the first time I have received this error while crawling.  
During a crawl of 100K pages, one of the nodes had a task failed and  
cited Too Many Fetch Failures as the reason. The job completed  
successfully but took about 3 times longer than normal. Here is the  
log output



2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

at java.io.FilterInputStream.close(FilterInputStream.java:155)
at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
at org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting attempt_200911191100_0001_r_04_1


Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood

Julien,

Thanks for your help, how would I go about fixing this error now that  
it is diagnosed?


On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote:

could be a communication problem between the node and the master. It  
is not

a fetching problem in the Nutch sense of the term but a Hadoop-related
issue.

2009/11/19 Eric Osgood e...@lakemeadonline.com

This is the first time I have received this error while crawling.  
During a
crawl of 100K pages, one of the nodes had a task failed and cited  
Too Many
Fetch Failures as the reason. The job completed successfully but  
took about

3 times longer than normal. Here is the log output


2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running  
child

java.io.IOException: Filesystem closed
  at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
  at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)

  at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

  at java.io.FilterInputStream.close(FilterInputStream.java:155)
  at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
  at
org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)

  at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
  at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_r_04_1

Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood,
www.lakemeadonline.com





--
DigitalPebble Ltd
http://www.digitalpebble.com





Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood

Julien,

Another thought - I just installed tomcat and solr - would that  
interfere with hadoop?

On Nov 19, 2009, at 2:41 PM, Eric Osgood wrote:


Julien,

Thanks for your help, how would I go about fixing this error now  
that it is diagnosed?


On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote:

could be a communication problem between the node and the master.  
It is not
a fetching problem in the Nutch sense of the term but a Hadoop- 
related

issue.

2009/11/19 Eric Osgood e...@lakemeadonline.com

This is the first time I have received this error while crawling.  
During a
crawl of 100K pages, one of the nodes had a task failed and cited  
Too Many
Fetch Failures as the reason. The job completed successfully but  
took about

3 times longer than normal. Here is the log output


2009-11-19 11:19:56,377 WARN  mapred.TaskTracker - Error running  
child

java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java: 
197)
 at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java: 
65)

 at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close 
(DFSClient.java:1575)

 at java.io.FilterInputStream.close(FilterInputStream.java:155)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:91)
 at
org.apache.hadoop.mapred.LineRecordReader.close 
(LineRecordReader.java:169)

 at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close 
(MapTask.java:198)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:346)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
2009-11-19 11:19:56,380 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN  mapred.TaskRunner - Parent died.   
Exiting

attempt_200911191100_0001_r_04_1

Can Anyone tell me how to resolve this error?

Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood http://www.calpoly.edu/%7Eeosgood,
www.lakemeadonline.com





--
DigitalPebble Ltd
http://www.digitalpebble.com





Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



HELP - ERROR: org.apache.hadoop.fs.ChecksumException: Checksum Error

2009-10-29 Thread Eric Osgood

Hi,

I think that the checksum error during fetch is leading a bunch of  
other errors I am getting when I try to run updateb and generate after  
a fetch.


errors during updatedb:
---
java.lang.RuntimeException: problem advancing post rec#1018238
Caused by: java.io.IOException: can't find class:  
org.apache.nutch.protocgl.ProtocolStatus because  
org.apache.nutch.protocgl.ProtocolStatus

---
errors during generate:
---
java.lang.ArrayIndexOutOfBoundsException: 1107937
org.apache.hadoop.fs.ChecksumException: Checksum Error
java.io.IOException: Task: attempt_200910271443_0022_r_06_0 - The  
reduce copier failed

.
.
.
--

Any help would greatly be appreciated, I don't really know where to  
start to fix these problems since this is first time I have  
encountered - my guess is that they are rooted in the checksum error I  
get when fetching sometimes.


Thanks for the help,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



ERROR: Checksum Error

2009-10-27 Thread Eric Osgood

This is my second time receiving this error:

Map output lost, rescheduling: getMapOutput 
(attempt_200910271443_0012_m_01_0,0) failed :

org.apache.hadoop.fs.ChecksumException: Checksum Error
---
Does anyone know why I am getting this error and how to fix it? I  
tried deleting all my data nodes and formatting the namenode to no  
avail.

Thanks,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood

Andrzej,

Based on what you suggested below, I have begun to write my own  
scoring plugin:


in distributeScoreToOutlinks() if the link contains the string im  
looking for, I set its score to kept_score and add a flag to the  
metaData in parseData (KEEP, true). How do I check for this flag  
in generatorSortValue()? I only see a way to check the score, not a  
flag.


Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:


Eric Osgood wrote:

Andrzej,
How would I check for a flag during fetch?


You would check for a flag during generation - please check  
ScoringFilter.generatorSortValue(), that's where you can check for a  
flag and set the sort value to Float.MIN_VALUE - this way the link  
will never be selected for fetching.


And you would put the flag in CrawlDatum metadata when  
ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().



Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but  
still needing a total of X links per page, if I find the links I  
want, I add them to the list up until X, if I don' reach X, I add  
other links until X is reached. This way, I don't waste crawl time  
on non-relevant links.


You can modify the collection of target links passed to  
distributeScoreToOutlinks() - this way you can affect both which  
links are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out  
unwanted links, but that API gives you much less control because  
it's a simple yes/no that considers just URL string. The advantage  
is that it's much easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Scoring Filter Plugin

2009-10-22 Thread Eric Osgood

Hi,

I am trying to implement a scoring filter plugin that filters url  
links. I was told that if I set the score of a link to Float.MinValue,  
it would never get selected for fetch. In my plugin, if a link doesnt  
have a high enough score when it gets to the generateSortValue, I set  
its score to Float.MinValue, however it is still getting fetched. Is  
there another to tell the fetcher to not fetch certain links based on  
their score?


Thanks,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood

Also,

In the scoring-links plugin, I set the return value for  
ScoringFilter.generatorSortValue() to Float.MinValue for all urls and  
it still fetched everything - maybe Float.MinValue isn't the correct  
value to set so a link never gets fetched?


Thanks,

Eric

On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote:


Andrzej,

Based on what you suggested below, I have begun to write my own  
scoring plugin:


in distributeScoreToOutlinks() if the link contains the string im  
looking for, I set its score to kept_score and add a flag to the  
metaData in parseData (KEEP, true). How do I check for this flag  
in generatorSortValue()? I only see a way to check the score, not a  
flag.


Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:


Eric Osgood wrote:

Andrzej,
How would I check for a flag during fetch?


You would check for a flag during generation - please check  
ScoringFilter.generatorSortValue(), that's where you can check for  
a flag and set the sort value to Float.MIN_VALUE - this way the  
link will never be selected for fetching.


And you would put the flag in CrawlDatum metadata when  
ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().



Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page,  
but still needing a total of X links per page, if I find the links  
I want, I add them to the list up until X, if I don' reach X, I  
add other links until X is reached. This way, I don't waste crawl  
time on non-relevant links.


You can modify the collection of target links passed to  
distributeScoreToOutlinks() - this way you can affect both which  
links are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out  
unwanted links, but that API gives you much less control because  
it's a simple yes/no that considers just URL string. The advantage  
is that it's much easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-21 Thread Eric Osgood

Andrzej,

I updated nutch to the trunk last night and I split up a crawl of 1.6M  
into 4 chunks of 400K using the updated generator. However, the first  
crawl of 400K crashed last night with some new errors I have never  
seen before:


org.apache.hadoop.fs.ChecksumException: Checksum Error
java.io.IOException: Could not obtain block:  
blk_-8206810763586975866_5190 file=/user/hadoop/crawl/segments/ 
20091020170107/crawl_generate/part-9
Do you know why I would be getting these errors? I had a lost tracker  
error also - could these problems be related?


Thanks,

Eric


On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than  
400K files at a time using a 4 node hadoop cluster running nutch 1.0.
org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:  
failed to create file /user/hadoop/crawl/segments/20091013161641/ 
crawl_fetch/part-00015/index for  
DFSClient_attempt_200910131302_0011_r_15_2 on client  
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me  
if this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed  
that it fixes this issue).


Can anybody shed some light on this issue? I was under the  
impression that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln  
pages in various configs and it didn't occur to me personally. It  
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
This is the error I keep getting whenever I try to fetch more than  
400K files at a time using a 4 node hadoop cluster running nutch 1.0.


org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed  
to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ 
part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on  
client 192.168.1.201 because current leaseholder is trying to recreate  
file.


Can anybody shed some light on this issue? I was under the impression  
that 400K was small potatoes for a nutch hadoop combo?


Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood

Andrzej,

I just downloaded the most recent trunk from svn as per your  
recommendations for fixing the generate bug. As soon I have it all  
rebuilt with my configs I will let you know how a crawl of ~1.6mln  
pages goes. Hopefully no errors!


Thanks,

Eric

On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than  
400K files at a time using a 4 node hadoop cluster running nutch 1.0.
org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:  
failed to create file /user/hadoop/crawl/segments/20091013161641/ 
crawl_fetch/part-00015/index for  
DFSClient_attempt_200910131302_0011_r_15_2 on client  
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me  
if this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed  
that it fixes this issue).


Can anybody shed some light on this issue? I was under the  
impression that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln  
pages in various configs and it didn't occur to me personally. It  
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Dynamic Html Parsing

2009-10-15 Thread Eric Osgood
Is there a way to enable Dynamic Html parsing in Nutch using a plugin  
or setting?


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood

Andrzej,

Where do I get the nightly builds from? I tried to use the eclipse  
plugin that supports svn to no avail. Is there a ftp, http server  
where I can download the nutch source fresh?


Thanks,

Eric

On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
When I set generate.update.db to true and then run generate, it  
only runs twice and generates 100K for the 1st gen, 62.5K for the  
second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't  
understand this, for a topN of 100K it should run 16 times and  
create 16 distinct lists if I am not mistaken.


There was a bug in this code that I fixed recently - please get a  
new nightly build and try it again.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood
Ok, I think I am on the right track now, but just to be sure: the code  
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ 
 correct?


Thanks,

Eric


On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:

Andrzej,
Where do I get the nightly builds from? I tried to use the eclipse  
plugin that supports svn to no avail. Is there a ftp, http server  
where I can download the nutch source fresh?


Personally I prefer to use a command-line svn, even though I do  
development in Eclipse - I'm probably old-fashioned but I always  
want to be very clear on what's going on when I do an update.


See the instructions here:

http://lucene.apache.org/nutch/version_control.html


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood

O ok,

You learn something new everyday! I didn't know that the trunk was the  
most recent build. Good to know! So this current trunk does have a fix  
for the generator bug?



On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:

So the trunk contains the most recent nightly update?


It's the other way around - nightly build is created from a snapshot  
of the trunk. The trunk is always the most recent.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Incremental Whole Web Crawling

2009-10-11 Thread Eric Osgood
When I set generate.update.db to true and then run generate, it only  
runs twice and generates 100K for the 1st gen, 62.5K for the second  
gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand  
this, for a topN of 100K it should run 16 times and create 16 distinct  
lists if I am not mistaken.


Eric


On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:


Hey,

Never mind. I got *generate.update.db* in *nutch-default.xml* and  
set it

true.

Regards,
Gaurang

2009/10/5 Gaurang Patel gaurangtpa...@gmail.com


Hey Andrzej,

Can you tell me where to set this property (generate.update.db)? I am
trying to run similar kind of crawl scenario that Eric is running.

-Gaurang

2009/10/5 Andrzej Bialecki a...@getopt.org

Eric wrote:



Andrzej,

Just to make sure I have this straight, set the generate.update.db
property to true then

bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16  
times?




Yes. When this property is set to true, then each fetchlist will be
different, because the records for those pages that are already on  
another
fetchlist will be temporarily locked. Please note that this lock  
holds only

for 1 week, so you need to fetch all segments within one week from
generating them.

You can fetch and updatedb in arbitrary order, so once you fetched  
some
segments you can run the parsing and updatedb just from these  
segments,

without waiting for all 16 segments to be processed.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: generate/fetch using multiple machines

2009-10-06 Thread Eric
yes, using a hadoop cluster. I would recommend the tutorial called  
NutchHadoopTutorial on the wiki.

On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote:


All-

Idea on how to configure nutch to generate/fetch on multiple machines
simultaneously?

-Gaurang




Hadoop Script

2009-10-06 Thread Eric
Has anyone written a script for whole web crawling using Hadoop? The  
script for nutch doesn't work since the data is inside the HDFS (tail - 
f wont work with this).


Thanks,

Eric


Re: Hadoop Script

2009-10-06 Thread Eric Osgood

Sorry Ryan,

I should have clarified that I am using Nutch as my crawler. There is  
a script for Nutch to do Whole web crawling, but it is not compatible  
with Hadoop.



Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
-
eosg...@calpoly.edu
e...@lakemeadonline.com
-
www.calpoly.edu/eosgood
www.lakemeadonline.com

On Oct 6, 2009, at 12:24 PM, Ryan Smith wrote:


This isnt a script per-se but this may help.

http://code.google.com/p/hbase-writer

Its a plugin for heritrix2 web crawler to write crawled site data to  
hbase
tables, which run on hadoop.  Each url is written as a rowkey in the  
hbase

table.

HTH,
-Ryan

On Tue, Oct 6, 2009 at 3:02 PM, Eric e...@lakemeadonline.com wrote:

Has anyone written a script for whole web crawling using Hadoop?  
The script
for nutch doesn't work since the data is inside the HDFS (tail -f  
wont work

with this).

Thanks,

Eric





Targeting Specific Links

2009-10-06 Thread Eric Osgood
Is there a way to inspect the list of links that nutch finds per page  
and then at that point choose which links I want to include / exclude?  
that is the ideal remedy to my problem.


Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
-
eosg...@calpoly.edu
e...@lakemeadonline.com
-
www.calpoly.edu/eosgood
www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-06 Thread Eric Osgood

Andrzej,

How would I check for a flag during fetch?

Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but  
still needing a total of X links per page, if I find the links I want,  
I add them to the list up until X, if I don' reach X, I add other  
links until X is reached. This way, I don't waste crawl time on non- 
relevant links.


Thanks,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/eosgood, www.lakemeadonline.com


On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per  
page and then at that point choose which links I want to include /  
exclude? that is the ideal remedy to my problem.


Yes, look at ParseOutputFormat, you can make this decision there.  
There are two standard etension points where you can hook up -  
URLFilters and ScoringFilters.


Please note that if you use URLFilters to filter out URL-s too early  
then they will be rediscovered again and again. A better method to  
handle this, but also more complicated, is to still include such  
links but give them a special flag (in metadata) that prevents  
fetching. This requires that you implement a custom scoring plugin.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com







Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Does anyone know if it possible to target only certain links for  
crawling dynamically during a crawl? My goal would be to write a  
plugin for this functionality but I don't know where to start.


Thanks,

EO


Incremental Whole Web Crawling

2009-10-05 Thread Eric
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can  
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's  
then crawl the links generated from the TLD's in increments of 100K?


Thanks,

EO


Re: Targeting Specific Links for Crawling

2009-10-05 Thread Eric

Adam,

Yes, I have a list of strings I would look for in the link. My plan is  
to look for X number of links on the site - First looking for the  
links I want and if they exist, add them, if they don't  exist add X  
links from the site. I am planning to start in the URL Filter plugin.


Eric

On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote:





how to target certain links !! do you know how the links are made !?  
i mean their format ?
you can just set a regular expression to accept only those kind of  
links





Date: Mon, 5 Oct 2009 21:39:52 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Targeting Specific Links for Crawling

Eric wrote:

Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a  
plugin

for this functionality but I don't know where to start.


URLFilter plugins may be what you want.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403




Re: indexing just certain content

2009-10-05 Thread Eric

Adam,

You could turn off all the indexing plugins and write your own plugin  
that only indexes certain meta content from your intranet - giving you  
complete control of the fields indexed.


Eric

On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:



hi

does anybody know if it's possible to index just certain content ? i  
mean i need to dont index some garbage and repetitive data on my  
intranet.


in other way if it is possible to tell the indexer dont index the  
content between  certain div tags

like:

div id=bla bla


plz dont index this  bla  bla bla

/div

thx to all

_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403




Re: Incremental Whole Web Crawling

2009-10-05 Thread Eric

Andrzej,

Just to make sure I have this straight, set the generate.update.db  
property to true then


bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?

Thanks,

Eric

On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:


Eric wrote:
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I  
can crawl it in increments of 100K? e.g. crawl 100K 16 times for  
the TLD's then crawl the links generated from the TLD's in  
increments of 100K?


Yes. Make sure that you have the generate.update.db property set  
to true, and then generate 16 segments each having 100k urls. After  
you finish generating them, then you can start fetching.


Similarly, you can do the same for the next level, only you will  
have to generate more segments.


This could be done much simpler with a modified Generator that  
outputs multiple segments from one job, but it's not implemented yet.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson


On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote:


Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index.



http://wiki.apache.org/nutch/IndexStructure
is the closest thing I've found apart from reading the code.

Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University, Fargo, North Dakota, USA



PGP.sig
Description: This is a digitally signed message part


Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson


On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote:


Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index.



http://wiki.apache.org/nutch/IndexStructure
is the closest thing I've found apart from reading the code.

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



PGP.sig
Description: This is a digitally signed message part


Re: Index Disaster Recovery

2009-03-17 Thread Eric J. Christeson


On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote:



Eric,

There are a couple of ways you can back up a Lucene index built by  
Solr:


1) have a look at the Solr replication scripts, specifically  
snapshooter.  This script creates a snapshot of an index.  It's  
typically triggered by Solr after its commit or optimize calls,  
when the index is stable and not being modified.  If you use  
snapshooter to create index snapshots, you could simply grab a  
snapshot and there is your backup.


2) have a look at Solr's new replication mechanism (info on the  
Solr Wiki), which does something similar to the above, but without  
relying on replication (shell) scripts.  It does everything via HTTP.


In my 10 years of using Lucene and N years of using Solr and Nutch  
I've never had index corruption.  Nowadays Lucene even has  
transactions, so it's much harder (theoretically impossible) to  
corrupt the index.


Thank you for the information.  I happened to read about snapshooter  
about 10 minutes after I sent that message, but didn't know about  
replication.  It inspires confidence that you haven't experienced  
index corruption in your years of using this technology.


Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



Index Disaster Recovery

2009-03-13 Thread Eric J. Christeson
What do people do when 'something goes wrong' with a crawl?
First some background; We are a small-ish university using nutch to
crawl 60,000 - 100,000 pages across 50 or so domains.  This probably
puts us in a different category than most nutch users.  Our crawl cycle
consists of a script to crawl everything, one domain at a time, each
Sunday and run search across all the indexes (one per domain).  Our
original reason for this was that merging was taking too long, but this
also keeps one bad index (or a crawl with bad results) from destroying
everything.  Maybe we're worrying about nothing since we haven't had any
problems in almost a year of production use (knock on wood) and I don't
know how often indexes 'blow up'.  We also move the previous week's
indexes out of the way before replacing them so we have a backup if
something happens.
We have been moving things to a CMS and want to move to a system where
pages are indexed as they are edited, while still being able to crawl
things that don't fit in CMS.  This would be a big incentive for most of
our people to use CMS.  The solr back end looks promising, but I'm not
sure how to implement a recovery plan with solr.  Any thoughts or
experience with backing up solr indexes?  Is it as simple as moving the
index like we do with nutch indexes?

Thanks,
Eric
--
-- 
Eric J. Christeson eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure
Phone: (701) 231-8693
North Dakota State University, Fargo, North Dakota, USA



signature.asc
Description: OpenPGP digital signature


Re: How to use versions from the trunk

2009-03-05 Thread Eric J. Christeson


On Mar 5, 2009, at 4:12 PM, Jim Van Sciver wrote:


Hello all, newbie question alert.

I have been using Nutch 0.9 and want to try out versions on the trunk.
To do this I have installed the latest build, #743 March 5 2009, and
made the small number of necessary changes for a trial run:
  - add my organization name to conf/nutch-default.xml file
  - add single domain name to conf/crawl-urlfilter.txt (this is an
intranet crawl)
  - create a single entry urls.txt file

Then I run the nutch crawl command with depth 2 (first trial run) and
get a stack dump.  (See below).

I feel that I've missed something in the installation.  Could someone
tell me what?  Many thanks.

-- 
-

Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
   at java.lang.ClassLoader.defineClass1(Native Method)
   at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
   at java.security.SecureClassLoader.defineClass 
(SecureClassLoader.java:124)

   at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
   at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 
268)

   at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
   at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 
319)


You need to be using Java 6.  Hadoop 0.19 requires it.

Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



Re: what is needed to index for about 10000 domains

2009-03-04 Thread Eric J. Christeson


On Mar 3, 2009, at 10:32 PM, Jasper Kamperman wrote:

There is a way to tell nutch to look at only the beginning of a  
file, it's this section in your config.xml:


property
  namefile.content.limit/name
  value65536/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will  
be truncated;

  otherwise, no truncation at all.
  /description
/property

this is from the nutch-default.xml in 0.9, don't know whether it  
has changed in 1.0 .


This might also depend upon what type of files you're trying to  
index.  We ended up using -1 for unlimited after running into some  
15MB pdf files.  The pdf parser would barf if it didn't get the whole  
file.  This was with 0.9, don't know if 1.0 includes


Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



PGP.sig
Description: This is a digitally signed message part


Re: AW: Does not locate my urls or filter problem.

2009-02-26 Thread Eric J. Christeson
Koch Martina wrote:
 Please check your nutch-site.xml. If the property urlfilter.regex.file 
 there points to another file than your crawl-urlfilter.txt this setting 
 takes precedence.
 You can also disable the urlfilter-regex plugin by removing it from the 
 plugin.includes property of nutch-site.xml and check if your crawl starts 
 fecthing URLs.
 
 Kind regards,
 Martina

In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to
regex-urlfilter.txt

Thanks,
Eric



signature.asc
Description: OpenPGP digital signature


Re: Build #722 won't start on Mac OS X, 10.4.11

2009-02-15 Thread Eric Christeson

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


On Feb 14, 2009, at 20:16, David M. Cole wrote:


Hiya:

Brand new to Nutch. Was able to get it to work with Tomcat on a Mac  
OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with  
the latest version of Java (1.5.0_16-132). Indexed great, am able  
to search via OpenSearch option with zero problems.


Unfortunately, I need HTTP authorization (basic, not digest or  
NTLM) for the site I'm trying to index.


I downloaded nightly build #722 the other day, added the  
credentials info into 'conf/httpclient-auth.xml' and have not been  
able to get it to launch -- I receive the error Bad version number  
in .class file on the command line when I run a crawl command.


Later versions of nutch-dev use Hadoop 0.19 which requires Java 1.6.   
They used some features introduced in 1.6  If you ask google about  
the 'Bad version number' you'll find that it refers to cases exactly  
like this where a library need a (usually) newer jvm.


Eric

- --
Eric J. Christeson   eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure  (701) 231-8693
North Dakota State University, Fargo, North Dakota
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.1 (Darwin)

iD8DBQFJmBsxCnMyGd/wX/sRAjT3AJ9URHF4uw+6wpeV6aWDreLjOSD/hgCdFsJv
of/zE4M/RcBZbeg1nvBp+PE=
=PsdR
-END PGP SIGNATURE-


Re: Crawler not fetching all the links

2009-01-14 Thread Eric J. Christeson


On Jan 14, 2009, at 12:44 PM, ahammad wrote:



Hello,

I'm still unable to find why Nutch is unable to fetch and index all  
the
links that are on the page. To recap, the Nutch urls file contains  
a link to
a jhtml file that contains roughly 2000 links, all hosted on the  
same server

in the same folder.

Previously, I only got 111 links when I crawl. This was due to this:

property
  namedb.max.outlinks.per.page/name
  value100/value
  descriptionThe maximum number of outlinks that we'll process  
for a page.

  If this value is nonnegative (=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be  
processed.

  /description
/property


You may also want to change this one:

property
  namefile.content.limit/name
  value65536/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be  
truncated;

  otherwise, no truncation at all.
  /description
/property

Eric
--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University


PGP.sig
Description: This is a digitally signed message part


Incremental indexing

2008-11-21 Thread Eric C
How is it done ?

for now the what i do is to merge 2 crawl into a new one :

bin/nutch mergedb crawl/crawldb crawl1/crawldb/ crawl2/crawldb/

Is that the only solution ?


Re: Can I update my search engine without restarting tomcat?

2008-06-19 Thread Eric J. Christeson


On Jun 19, 2008, at 1:20 PM, John Thompson wrote:

You can set up an account in Tomcat Manager if you don't have one  
already.
The Manager lets you go in and independently start/stop/reload any  
of the
different webapps you have running. This is exactly how I get new  
Nutch

crawls/indexes to be active on our production server.



Won't restarting the webapp cause Tomcat to serve up error pages to  
users

who are trying to connect to the webapp at that moment?


We set up our own servlet which uses a NutchBean in a Singleton  
pattern.  It runs an update thread which periodically checks a known  
file which contains a directory name.  When loading a new search db,  
we copy the directory where we want it (we use a directory with a  
date in the name) and edit the file to point to the new dir.  A  
client never has problems related to unavailability because they  
either get the old NutchBean, referencing the old dir, or the new  
NutchBean, referencing the new dir.  We keep the old ones around for  
a period of time (default 5 minutes) in case anyone has a search page  
open and wants previous/next page of results.
We haven't run into any problems with this setup.  If anyone wants  
more information, let me know.


--
Eric J. Christeson  
[EMAIL PROTECTED]

Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which

are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law






Re: two questions about nutch url filter when inject

2008-06-18 Thread Eric J. Christeson


On Jun 18, 2008, at 9:38 AM, beansproud wrote:




Second, when I changed this file, the output of nutch doesn't  
show any
chang. And when I recomplied, it changed. This takes me 3 hours,  
can anybody

tell me why ?


What changes did you make to the file, and what specifically changed  
when you recompiled?


eric

--
Eric J. Christeson  
[EMAIL PROTECTED]

Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which

are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law





Re: Field phrases

2008-06-09 Thread Eric J. Christeson


On Jun 8, 2008, at 12:31 PM, Aldarris wrote:



Hello:

How can I search for a field phrase?

A content phrase is described in the help files -- one just wraps  
the words

in quotes.
But what if one wants to search for url, for example, abc.go.com/ 
folder?


Apparently, I need to search for abc,go, com and folder next to  
each other.

What is the proper syntax?

url:abc url:folder produces quite a few wrong results.


It should work with with url:abc go com folder

eric

--
Eric J. Christeson  
[EMAIL PROTECTED]

Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which

are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law





Re: Ignoring robots.txt

2008-05-27 Thread Eric J. Christeson


On May 27, 2008, at 11:42 AM, Vijay Krishnan wrote:


Hi all,

 Do you know what file in Nutch parses robots.txt? If there is no
option in Nutch to ignore robots.txt, I would at least like to modify
the source.


Thanks,
Vijay



src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ 
RobotRulesParser.java


parses robots.txt


src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ 
HTMLMetaProcessor.java


parses robot rules from html documents.



eric

Eric J. Christeson  
[EMAIL PROTECTED]

Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which

are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law





Re: Problems with indexing sub-section of a site

2008-05-24 Thread Eric J. Christeson
On Thu, May 22, 2008 at 07:46:16PM -0700, foobar3001 wrote:
 
 Hello!
 
 In short:
 
 Is it possible to tell Nutch to follow the links through one larger name
 space, but only index (add to its database) the content of links that are in
 a sub-name space of that?
 
 The background:
 
 I have started to experiment with crawling my blog with Nutch. The problem
 is that this blog doesn't have its own domain. Instead, it it is hosted on a
 larger site, which also hosts discussion forums and other people's blogs.
 
 My URL there is http://www.geekzone.co.nz/foobar;, so naturally I thought
 that adding something in the crawl-urlfilter.txt file would help. Something
 like this:
 
   +^http://([a-z0-9]*\.)*geekzone.co.nz/foobar
 
 But look at the bottom of that page: The navigation links to the other pages
 in my blog - or to 'next' page - actually lead out of my namespace. Thus,
 they are not being picked up anymore and Nutch never sees the additional
 links that I have on those other pages.
 
 Since eventually I would like this to be a bit more generic (I don't want
 anything specific for my blog, that's just a test case), I thought that
 maybe I have to open it up to the root URL, making the filter something like
 this:
 
   +^http://([a-z0-9]*\.)*geekzone.co.nz
 
 But then it picks up a ton of other stuff that I am not interested to have
 in my database.
 
 So, now I'm wondering whether it is possible to tell Nutch to follow links
 through one namespace, but only add those pages into its index database that
 are in a specific sub-namespace of the first one?

Did a quick scan of the page in question, and I noticed the urls are of
this form:
http://www.geekzone.co.nz/blog.asp?blogid=207

Could you filter like 

+^http://([a-z0-9]*\.)*geekzone.co.nz/blog.asp\?blogid=207

You'll have to comment out the default ? killer or put this rule before
it.

Maybe there's something I'm missing, though.

Eric

-- 
Eric J. Christeson  [EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building  
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs which
are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law


Re: Error: Generator: 0 records selected for fetching, exiting ...

2008-05-21 Thread Eric J. Christeson


On May 21, 2008, at 7:22 AM, Abhijit Bera wrote:


I totally give up.

I tried whole web crawling and still I get the above error! :(



Have you tried it with an empty crawl directory?  If those pages had  
been successfully crawled before and are in your crawldb, the fetch  
interval probably hasn't passed yet.


eric

Eric J. Christeson  
[EMAIL PROTECTED]

Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164

Organizations which design systems are constrained to produce designs  
which

are copies of the communication structures of these organizations.  (For
example, if you have four groups working on a compiler, you'll get a
4-pass compiler) - Conway's Law





Null pointer error when perform search

2006-07-21 Thread Eric Wu

Hi,

I am new to Nutch and I got a null pointer exception whenI try to submit the
search through demo app.
Please see the error message below. I have modified the demo app to run in
its webapp context other than in ROOT context.
The first page shown and I put in the keyword to search and got the error.
Did I do somthing wrong? Please help. Thanks.

- Eric



*type* Exception report

*message*

*description* *The server encountered an internal error () that prevented it
from fulfilling this request.*

*exception*

org.apache.jasper.JasperException

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:370)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)

org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

*root cause*

java.lang.NullPointerException
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:82)
org.apache.nutch.searcher.NutchBean
.init(NutchBean.java:72)
org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)
org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:112)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java
:97)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:322)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java
:291)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

*note* *The full stack trace of the root cause is available in the Apache
Tomcat/5.5.9 logs.*


java.util.MissingResourceException on resin

2006-05-25 Thread eric park

hello, I intalled nutch on resin3.0. When I open search.jsp, I get this
error below.

search.jsp line 116 is  -   i18n:bundle baseName=org.nutch.jsp.search
/

any thoughts?
Thank you


java.util.MissingResourceException: Can't find bundle for base name
org.nutch.jsp.search,
locale ko_KR
at java.util.ResourceBundle.throwMissingResourceException(
ResourceBundle.java:837)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:727)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:550)
at org.apache.taglibs.i18n.BundleTag.findBundle(BundleTag.java:309)
at org.apache.taglibs.i18n.BundleTag.doStartTag(BundleTag.java:333)
at _jsp._search__jsp._jspService(search.jsp:116)
at com.caucho.jsp.JavaPage.service(JavaPage.java:63)
at com.caucho.jsp.Page.pageservice(Page.java:570)
at com.caucho.server.dispatch.PageFilterChain.doFilter(PageFilterChain.java
:159)
at com.caucho.server.webapp.WebAppFilterChain.doFilter(
WebAppFilterChain.java:163)
at com.caucho.server.dispatch.ServletInvocation.service(
ServletInvocation.java:208)
at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:259)
at com.caucho.server.port.TcpConnection.run(TcpConnection.java:341)
at com.caucho.util.ThreadPool.runTasks(ThreadPool.java:490)
at com.caucho.util.ThreadPool.run(ThreadPool.java:423)
at java.lang.Thread.run(Thread.java:595)


Nutch0.6 and Nutch 0.7 crawlers

2006-04-12 Thread eric park
hello. I tried to crawl a certain site using both nutch 0.6 and nutch 0.7,
just to compare how they are different.

However I get less urls crawled using nutch0-7 than nutch0-6.   I'll paste 2
different log files below.



As you can see below, both 0.6 and 0.7 fetch same number of urls in first
depth, but in second depth, nutch0.7 fetches only 15 urls while
nutch0.7fetches 34 urls.  Of course, the configuration and settings
are same.



Can you tell me why I get this results?

Thank you



Eric Park





log for Nutch 0.6 crawler



--



060328 182513 logging at INFO
060328 182513 fetching http://www.qmind.co.kr/sub4/sub4_1.htm
060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_3.htm
060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_5.htm
060328 182513 fetching http://www.qmind.co.kr/sub3/sub3_21.htm
060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_1.htm
060328 182513 fetching http://www.qmind.co.kr/sub1/sub1_4.htm
060328 182513 fetching http://www.qmind.co.kr/sub5/sub5_4.htm
060328 182513 fetching http://www.qmind.co.kr/sub2/sub2_21.htm
060328 182513 fetching http://www.qmind.co.kr/sub6/sub6_2.htm
060328 182513 fetching http://qmind.co.kr/
060328 182513 fetching http://www.qmind.co.kr/sub3/sub3_11.htm
060328 182514 fetching http://www.qmind.co.kr/sub2/sub2_11.htm
060328 182515 fetching http://www.qmind.co.kr/sub6/sub6_1.htm
060328 182516 fetching http://www.qmind.co.kr/sub1/sub1_2.htm
060328 182517 fetching http://www.qmind.co.kr/sub5/sub5_1.htm
060328 182528 status: segment 20060328182513, 15 pages, 0 errors, 271658
bytes, 15149 ms
060328 182528 status: 0.99016434 pages/s, 140.09691 kb/s, 18110.533bytes/page
060328 182529 Updating /usr/local/nutch-0.7/crawl_folder/crawl.qmind2/db
060328 182529 Updating for /usr/local/nutch-0.7
/crawl_folder/crawl.qmind2/segments/20060328182513
060328 182529 Processing document 0
060328 182530 Finishing update
060328 182530 Processing pagesByURL: Sorted 437 instructions in 0.013seconds.
060328 182530 Processing pagesByURL: Sorted 33615.38461538462instructions/second
060328 182530 Processing pagesByURL: Merged to new DB containing 51 records
in 0.014 seconds
060328 182530 Processing pagesByURL: Merged 3642.8571428571427records/second
060328 182530 Processing pagesByMD5: Sorted 65 instructions in 0.0020seconds.
060328 182530 Processing pagesByMD5: Sorted 32500.0 instructions/second
060328 182530 Processing pagesByMD5: Merged to new DB containing 51 records
in 0.0040 seconds
060328 182530 Processing pagesByMD5: Merged 12750.0 records/second
060328 182530 Processing linksByMD5: Sorted 437 instructions in 0.0060seconds.
060328 182530 Processing linksByMD5: Sorted 72833.333instructions/second
060328 182530 Processing linksByMD5: Merged to new DB containing 275 records
in 0.018 seconds
060328 182530 Processing linksByMD5: Merged 15277.778 records/second
060328 182530 Processing linksByURL: Sorted 260 instructions in 0.0040seconds.
060328 182530 Processing linksByURL: Sorted 65000.0 instructions/second
060328 182530 Processing linksByURL: Merged to new DB containing 275 records
in 0.016 seconds
060328 182530 Processing linksByURL: Merged 17187.5 records/second
060328 182530 Processing linksByMD5: Sorted 274 instructions in 0.0040seconds.
060328 182530 Processing linksByMD5: Sorted 68500.0 instructions/second
060328 182530 Processing linksByMD5: Merged to new DB containing 275 records
in 0.0090 seconds
060328 182530 Processing linksByMD5: Merged 30555.556 records/second
060328 182530 Update finished
060328 182530 FetchListTool started
060328 182530 Processing pagesByURL: Sorted 35 instructions in 0.0020seconds.
060328 182530 Processing pagesByURL: Sorted 17500.0 instructions/second
060328 182530 Processing pagesByURL: Merged to new DB containing 51 records
in 0.0010 seconds
060328 182530 Processing pagesByURL: Merged 51000.0 records/second
060328 182530 Processing pagesByMD5: Sorted 35 instructions in 0.0020seconds.
060328 182530 Processing pagesByMD5: Sorted 17500.0 instructions/second
060328 182530 Processing pagesByMD5: Merged to new DB containing 51 records
in 0.0020 seconds
060328 182530 Processing pagesByMD5: Merged 25500.0 records/second
060328 182530 Processing linksByMD5: Copied file (4096 bytes) in 0.0010secs.
060328 182530 Processing linksByURL: Copied file (4096 bytes) in 0.0020secs.
060328 182530 Processing
/usr/local/nutch-0.7/crawl_folder/crawl.qmind2/segments/20060328182530/fetchlist.unsorted:
Sorted 35 entries in 0.0020 seconds.
060328 182530 Processing
/usr/local/nutch-0.7/crawl_folder/crawl.qmind2/segments/20060328182530/fetchlist.unsorted:
Sorted 17500.0 entries/second
060328 182530 Overall processing: Sorted 35 entries in 0.0020 seconds.
060328 182530 Overall processing: Sorted 5.714285714285714E-5 entries/second
060328 182530 FetchListTool completed
060328 182530 logging at INFO
060328 182530 fetching http://www.qmind.co.kr/sub2/sub2_14.htm
060328 182530 fetching http://www.qmind.co.kr/sub2/sub2_16.htm
060328

Re: Nutch0.6 and Nutch 0.7 crawlers

2006-04-12 Thread eric park
hello, the problem is they are not unwanted URLS.
I crawled on the site 'www.qmind.co.kr'. I found that the nutch7.0 crawler
works just fine in first depth. However in second depth,  it filters out any
links that start with 'www.qmind.co.kr'.  It only crawls urls starting with
'qmind.co.kr'.  I can't figure out why it filters out urls starting with
'www' in second depth. Nutch 6.0 works just fine. Are there any known bugs
in Nutch7.0 crawler?

thank you,
Erci Park

2006/4/12, Andrzej Bialecki [EMAIL PROTECTED]:

 eric park wrote:
  hello. I tried to crawl a certain site using both nutch 0.6 and nutch
 0.7,
  just to compare how they are different.
 
  However I get less urls crawled using nutch0-7 than nutch0-6.   I'll
 paste 2
  different log files below.
 
 
 
  As you can see below, both 0.6 and 0.7 fetch same number of urls in
 first
  depth, but in second depth, nutch0.7 fetches only 15 urls while
  nutch0.7fetches 34 urls.  Of course, the configuration and settings
  are same.
 

 IIRC (it was long ago...) the version 0.6 had a bug where unwanted URLs
 would slip through the URLFilters. This was tightened in 0.7. Please
 check that the URLs that are rejected in 0.7 are really valid URLs, i.e.
 that they should be accepted.

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com