Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$Ques.match(Pattern.java:3691) 2010-01-11 00:31:53,221 WARN io.UTF8 - truncating long string: 62492 chars, starting with java.lang.StackOverf Eric Osgood - Cal Poly - Computer Engineering

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Could not find the main class: index. Program will exit. Do you have to set the -Xss flag somewhere else? Thanks, Eric On Jan 11, 2010, at 8:36 AM, Godmar Back wrote: Very intriguing, considering that we

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
How do I set the bin/nutch stack size and the hadoop job stack size? --Eric On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote: Also, put it in Hadoop settings for tasks... http://www.tokenizer.ca/ -Original Message- From: Godmar Back [mailto:god...@gmail.com] Sent: January-11

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Eric Osgood
In the hadoop-env.sh, how do you add such options as -Xss, -Xms, -Xmx? --Eric On Jan 11, 2010, at 9:34 AM, Mischa Tuffield wrote: You can set it in hadoop-env.sh, and then run it. Or you could ad it to your /etc/bashrc or the bashrc file of the user which runs hadoop. Mischa On 11 Jan

Re: ERROR: Too Many Fetch Failures

2009-11-20 Thread Eric Osgood
information - I have no idea how to fix this problem. Thanks, Eric On Nov 20, 2009, at 1:30 AM, Julien Nioche wrote: It was probably a one-off, network related problem. Can you tell us a bit more about your cluster configuration? 2009/11/19 Eric Osgood e...@lakemeadonline.com Julien, Thanks

ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood
attempt_200911191100_0001_m_29_1 2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died. Exiting attempt_200911191100_0001_r_04_1 Can Anyone tell me how to resolve this error? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software

Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood
-related issue. 2009/11/19 Eric Osgood e...@lakemeadonline.com This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited Too Many Fetch Failures as the reason. The job completed successfully but took about 3 times

Re: ERROR: Too Many Fetch Failures

2009-11-19 Thread Eric Osgood
Julien, Another thought - I just installed tomcat and solr - would that interfere with hadoop? On Nov 19, 2009, at 2:41 PM, Eric Osgood wrote: Julien, Thanks for your help, how would I go about fixing this error now that it is diagnosed? On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote

HELP - ERROR: org.apache.hadoop.fs.ChecksumException: Checksum Error

2009-10-29 Thread Eric Osgood
fetching sometimes. Thanks for the help, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com

ERROR: Checksum Error

2009-10-27 Thread Eric Osgood
deleting all my data nodes and formatting the namenode to no avail. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com

Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood
in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check

Scoring Filter Plugin

2009-10-22 Thread Eric Osgood
score to Float.MinValue, however it is still getting fetched. Is there another to tell the fetcher to not fetch certain links based on their score? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software

Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood
Also, In the scoring-links plugin, I set the return value for ScoringFilter.generatorSortValue() to Float.MinValue for all urls and it still fetched everything - maybe Float.MinValue isn't the correct value to set so a link never gets fetched? Thanks, Eric On Oct 22, 2009, at 1:10 PM

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-21 Thread Eric Osgood
java.io.IOException: Could not obtain block: blk_-8206810763586975866_5190 file=/user/hadoop/crawl/segments/ 20091020170107/crawl_generate/part-9 Do you know why I would be getting these errors? I had a lost tracker error also - could these problems be related? Thanks, Eric On Oct 20

ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
combo? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
Andrzej, I just downloaded the most recent trunk from svn as per your recommendations for fixing the generate bug. As soon I have it all rebuilt with my configs I will let you know how a crawl of ~1.6mln pages goes. Hopefully no errors! Thanks, Eric On Oct 20, 2009, at 2:13 PM, Andrzej

Dynamic Html Parsing

2009-10-15 Thread Eric Osgood
Is there a way to enable Dynamic Html parsing in Nutch using a plugin or setting? Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com

Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood
Andrzej, Where do I get the nightly builds from? I tried to use the eclipse plugin that supports svn to no avail. Is there a ftp, http server where I can download the nutch source fresh? Thanks, Eric On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote: Eric Osgood wrote: When I set

Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood
Ok, I think I am on the right track now, but just to be sure: the code I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct? Thanks, Eric On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote: Eric Osgood wrote

Re: Incremental Whole Web Crawling

2009-10-13 Thread Eric Osgood
O ok, You learn something new everyday! I didn't know that the trunk was the most recent build. Good to know! So this current trunk does have a fix for the generator bug? On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote: Eric Osgood wrote: So the trunk contains the most recent

Re: Incremental Whole Web Crawling

2009-10-11 Thread Eric Osgood
not mistaken. Eric On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote: Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property

Re: generate/fetch using multiple machines

2009-10-06 Thread Eric
yes, using a hadoop cluster. I would recommend the tutorial called NutchHadoopTutorial on the wiki. On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote: All- Idea on how to configure nutch to generate/fetch on multiple machines simultaneously? -Gaurang

Hadoop Script

2009-10-06 Thread Eric
Has anyone written a script for whole web crawling using Hadoop? The script for nutch doesn't work since the data is inside the HDFS (tail - f wont work with this). Thanks, Eric

Re: Hadoop Script

2009-10-06 Thread Eric Osgood
Sorry Ryan, I should have clarified that I am using Nutch as my crawler. There is a script for Nutch to do Whole web crawling, but it is not compatible with Hadoop. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software

Targeting Specific Links

2009-10-06 Thread Eric Osgood
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software

Re: Targeting Specific Links

2009-10-06 Thread Eric Osgood
add other links until X is reached. This way, I don't waste crawl time on non- relevant links. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e

Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. Thanks, EO

Incremental Whole Web Crawling

2009-10-05 Thread Eric
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then crawl the links generated from the TLD's in increments of 100K? Thanks, EO

Re: Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric

Re: indexing just certain content

2009-10-05 Thread Eric
Adam, You could turn off all the indexing plugins and write your own plugin that only indexes certain meta content from your intranet - giving you complete control of the fields indexed. Eric On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: hi does anybody know if it's possible

Re: Incremental Whole Web Crawling

2009-10-05 Thread Eric
Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Thanks, Eric On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote: Eric wrote: My plan is to crawl ~1.6M TLD's

Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson
apart from reading the code. Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University, Fargo, North Dakota, USA PGP.sig Description: This is a digitally signed message

Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson
apart from reading the code. -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University PGP.sig Description: This is a digitally signed message part

Re: Index Disaster Recovery

2009-03-17 Thread Eric J. Christeson
On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote: Eric, There are a couple of ways you can back up a Lucene index built by Solr: 1) have a look at the Solr replication scripts, specifically snapshooter. This script creates a snapshot of an index. It's typically triggered by Solr

Index Disaster Recovery

2009-03-13 Thread Eric J. Christeson
or experience with backing up solr indexes? Is it as simple as moving the index like we do with nutch indexes? Thanks, Eric -- -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure Phone: (701) 231-8693 North Dakota State University, Fargo, North Dakota

Re: How to use versions from the trunk

2009-03-05 Thread Eric J. Christeson
(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 268) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 319) You need to be using Java 6. Hadoop 0.19 requires it. Eric -- Eric J

Re: what is needed to index for about 10000 domains

2009-03-04 Thread Eric J. Christeson
. We ended up using -1 for unlimited after running into some 15MB pdf files. The pdf parser would barf if it didn't get the whole file. This was with 0.9, don't know if 1.0 includes Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing

Re: AW: Does not locate my urls or filter problem.

2009-02-26 Thread Eric J. Christeson
-site.xml and check if your crawl starts fecthing URLs. Kind regards, Martina In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to regex-urlfilter.txt Thanks, Eric signature.asc Description: OpenPGP digital signature

Re: Build #722 won't start on Mac OS X, 10.4.11

2009-02-15 Thread Eric Christeson
number' you'll find that it refers to cases exactly like this where a library need a (usually) newer jvm. Eric - -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure (701) 231-8693 North Dakota State University, Fargo, North Dakota -BEGIN PGP SIGNATURE

Re: Crawler not fetching all the links

2009-01-14 Thread Eric J. Christeson
(=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property Eric -- Eric J. Christeson eric.christe...@ndsu.edu Enterprise Computing and Infrastructure(701) 231-8693 (Voice) North Dakota State University

Incremental indexing

2008-11-21 Thread Eric C
How is it done ? for now the what i do is to merge 2 crawl into a new one : bin/nutch mergedb crawl/crawldb crawl1/crawldb/ crawl2/crawldb/ Is that the only solution ?

Re: Can I update my search engine without restarting tomcat?

2008-06-19 Thread Eric J. Christeson
. If anyone wants more information, let me know. -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems

Re: two questions about nutch url filter when inject

2008-06-18 Thread Eric J. Christeson
when you recompiled? eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained

Re: Field phrases

2008-06-09 Thread Eric J. Christeson
for abc,go, com and folder next to each other. What is the proper syntax? url:abc url:folder produces quite a few wrong results. It should work with with url:abc go com folder eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services

Re: Ignoring robots.txt

2008-05-27 Thread Eric J. Christeson
/api/ RobotRulesParser.java parses robots.txt src/plugin/parse-html/src/java/org/apache/nutch/parse/html/ HTMLMetaProcessor.java parses robot rules from html documents. eric Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701

Re: Problems with indexing sub-section of a site

2008-05-24 Thread Eric J. Christeson
]*\.)*geekzone.co.nz/blog.asp\?blogid=207 You'll have to comment out the default ? killer or put this rule before it. Maybe there's something I'm missing, though. Eric -- Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC

Re: Error: Generator: 0 records selected for fetching, exiting ...

2008-05-21 Thread Eric J. Christeson
hasn't passed yet. eric Eric J. Christeson [EMAIL PROTECTED] Information Technology Services (701) 231-8693 (Voice) Room 242C, IACC Building North Dakota State University, Fargo, ND 58105-5164 Organizations which design systems are constrained

Null pointer error when perform search

2006-07-21 Thread Eric Wu
the error. Did I do somthing wrong? Please help. Thanks. - Eric *type* Exception report *message* *description* *The server encountered an internal error () that prevented it from fulfilling this request.* *exception* org.apache.jasper.JasperException

java.util.MissingResourceException on resin

2006-05-25 Thread eric park
hello, I intalled nutch on resin3.0. When I open search.jsp, I get this error below. search.jsp line 116 is - i18n:bundle baseName=org.nutch.jsp.search / any thoughts? Thank you java.util.MissingResourceException: Can't find bundle for base name org.nutch.jsp.search, locale ko_KR at

Nutch0.6 and Nutch 0.7 crawlers

2006-04-12 Thread eric park
depth, but in second depth, nutch0.7 fetches only 15 urls while nutch0.7fetches 34 urls. Of course, the configuration and settings are same. Can you tell me why I get this results? Thank you Eric Park log for Nutch 0.6 crawler -- 060328 182513 logging at INFO 060328 182513 fetching

Re: Nutch0.6 and Nutch 0.7 crawlers

2006-04-12 Thread eric park
can't figure out why it filters out urls starting with 'www' in second depth. Nutch 6.0 works just fine. Are there any known bugs in Nutch7.0 crawler? thank you, Erci Park 2006/4/12, Andrzej Bialecki [EMAIL PROTECTED]: eric park wrote: hello. I tried to crawl a certain site using both nutch 0.6