$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$Ques.match(Pattern.java:3691)
2010-01-11 00:31:53,221 WARN io.UTF8 - truncating long string: 62492 chars,
starting with java.lang.StackOverf
Eric Osgood
-
Cal Poly - Computer Engineering
(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: index. Program will exit.
Do you have to set the -Xss flag somewhere else?
Thanks,
Eric
On Jan 11, 2010, at 8:36 AM, Godmar Back wrote:
Very intriguing, considering that we
How do I set the bin/nutch stack size and the hadoop job stack size?
--Eric
On Jan 11, 2010, at 9:22 AM, Fuad Efendi wrote:
Also, put it in Hadoop settings for tasks...
http://www.tokenizer.ca/
-Original Message-
From: Godmar Back [mailto:god...@gmail.com]
Sent: January-11
In the hadoop-env.sh, how do you add such options as -Xss, -Xms, -Xmx?
--Eric
On Jan 11, 2010, at 9:34 AM, Mischa Tuffield wrote:
You can set it in hadoop-env.sh, and then run it. Or you could ad it to
your /etc/bashrc or the bashrc file of the user which runs hadoop.
Mischa
On 11 Jan
information - I have no idea how to
fix this problem.
Thanks,
Eric
On Nov 20, 2009, at 1:30 AM, Julien Nioche wrote:
It was probably a one-off, network related problem. Can you tell us
a bit
more about your cluster configuration?
2009/11/19 Eric Osgood e...@lakemeadonline.com
Julien,
Thanks
attempt_200911191100_0001_m_29_1
2009-11-19 11:20:21,135 WARN mapred.TaskRunner - Parent died.
Exiting attempt_200911191100_0001_r_04_1
Can Anyone tell me how to resolve this error?
Thanks,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-related
issue.
2009/11/19 Eric Osgood e...@lakemeadonline.com
This is the first time I have received this error while crawling.
During a
crawl of 100K pages, one of the nodes had a task failed and cited
Too Many
Fetch Failures as the reason. The job completed successfully but
took about
3 times
Julien,
Another thought - I just installed tomcat and solr - would that
interfere with hadoop?
On Nov 19, 2009, at 2:41 PM, Eric Osgood wrote:
Julien,
Thanks for your help, how would I go about fixing this error now
that it is diagnosed?
On Nov 19, 2009, at 1:50 PM, Julien Nioche wrote
fetching sometimes.
Thanks for the help,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
deleting all my data nodes and formatting the namenode to no
avail.
Thanks,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
in generatorSortValue()? I only see a way to check the score, not a
flag.
Thanks,
Eric
On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:
Eric Osgood wrote:
Andrzej,
How would I check for a flag during fetch?
You would check for a flag during generation - please check
score to Float.MinValue, however it is still getting fetched. Is
there another to tell the fetcher to not fetch certain links based on
their score?
Thanks,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
Also,
In the scoring-links plugin, I set the return value for
ScoringFilter.generatorSortValue() to Float.MinValue for all urls and
it still fetched everything - maybe Float.MinValue isn't the correct
value to set so a link never gets fetched?
Thanks,
Eric
On Oct 22, 2009, at 1:10 PM
java.io.IOException: Could not obtain block:
blk_-8206810763586975866_5190 file=/user/hadoop/crawl/segments/
20091020170107/crawl_generate/part-9
Do you know why I would be getting these errors? I had a lost tracker
error also - could these problems be related?
Thanks,
Eric
On Oct 20
combo?
Thanks,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood
Andrzej,
I just downloaded the most recent trunk from svn as per your
recommendations for fixing the generate bug. As soon I have it all
rebuilt with my configs I will let you know how a crawl of ~1.6mln
pages goes. Hopefully no errors!
Thanks,
Eric
On Oct 20, 2009, at 2:13 PM, Andrzej
Is there a way to enable Dynamic Html parsing in Nutch using a plugin
or setting?
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
Andrzej,
Where do I get the nightly builds from? I tried to use the eclipse
plugin that supports svn to no avail. Is there a ftp, http server
where I can download the nutch source fresh?
Thanks,
Eric
On Oct 11, 2009, at 12:40 PM, Andrzej Bialecki wrote:
Eric Osgood wrote:
When I set
Ok, I think I am on the right track now, but just to be sure: the code
I want is the branch section of svn under nutchbase at http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/
correct?
Thanks,
Eric
On Oct 13, 2009, at 1:38 PM, Andrzej Bialecki wrote:
Eric Osgood wrote
O ok,
You learn something new everyday! I didn't know that the trunk was the
most recent build. Good to know! So this current trunk does have a fix
for the generator bug?
On Oct 13, 2009, at 2:05 PM, Andrzej Bialecki wrote:
Eric Osgood wrote:
So the trunk contains the most recent
not mistaken.
Eric
On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:
Hey,
Never mind. I got *generate.update.db* in *nutch-default.xml* and
set it
true.
Regards,
Gaurang
2009/10/5 Gaurang Patel gaurangtpa...@gmail.com
Hey Andrzej,
Can you tell me where to set this property
yes, using a hadoop cluster. I would recommend the tutorial called
NutchHadoopTutorial on the wiki.
On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote:
All-
Idea on how to configure nutch to generate/fetch on multiple machines
simultaneously?
-Gaurang
Has anyone written a script for whole web crawling using Hadoop? The
script for nutch doesn't work since the data is inside the HDFS (tail -
f wont work with this).
Thanks,
Eric
Sorry Ryan,
I should have clarified that I am using Nutch as my crawler. There is
a script for Nutch to do Whole web crawling, but it is not compatible
with Hadoop.
Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
Is there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude?
that is the ideal remedy to my problem.
Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
add other
links until X is reached. This way, I don't waste crawl time on non-
relevant links.
Thanks,
Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a
plugin for this functionality but I don't know where to start.
Thanks,
EO
My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
then crawl the links generated from the TLD's in increments of 100K?
Thanks,
EO
Adam,
Yes, I have a list of strings I would look for in the link. My plan is
to look for X number of links on the site - First looking for the
links I want and if they exist, add them, if they don't exist add X
links from the site. I am planning to start in the URL Filter plugin.
Eric
Adam,
You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving you
complete control of the fields indexed.
Eric
On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:
hi
does anybody know if it's possible
Andrzej,
Just to make sure I have this straight, set the generate.update.db
property to true then
bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times?
Thanks,
Eric
On Oct 5, 2009, at 1:27 PM, Andrzej Bialecki wrote:
Eric wrote:
My plan is to crawl ~1.6M TLD's
apart from reading the code.
Eric
--
Eric J. Christeson
eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University, Fargo, North Dakota, USA
PGP.sig
Description: This is a digitally signed message
apart from reading the code.
--
Eric J. Christeson
eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University
PGP.sig
Description: This is a digitally signed message part
On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote:
Eric,
There are a couple of ways you can back up a Lucene index built by
Solr:
1) have a look at the Solr replication scripts, specifically
snapshooter. This script creates a snapshot of an index. It's
typically triggered by Solr
or
experience with backing up solr indexes? Is it as simple as moving the
index like we do with nutch indexes?
Thanks,
Eric
--
--
Eric J. Christeson eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure
Phone: (701) 231-8693
North Dakota State University, Fargo, North Dakota
(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:
319)
You need to be using Java 6. Hadoop 0.19 requires it.
Eric
--
Eric J
. We ended up using -1 for unlimited after running into some
15MB pdf files. The pdf parser would barf if it didn't get the whole
file. This was with 0.9, don't know if 1.0 includes
Eric
--
Eric J. Christeson
eric.christe...@ndsu.edu
Enterprise Computing
-site.xml and check if your crawl starts
fecthing URLs.
Kind regards,
Martina
In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to
regex-urlfilter.txt
Thanks,
Eric
signature.asc
Description: OpenPGP digital signature
number' you'll find that it refers to cases exactly
like this where a library need a (usually) newer jvm.
Eric
- --
Eric J. Christeson eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure (701) 231-8693
North Dakota State University, Fargo, North Dakota
-BEGIN PGP SIGNATURE
(=0), content longer than it will be
truncated;
otherwise, no truncation at all.
/description
/property
Eric
--
Eric J. Christeson
eric.christe...@ndsu.edu
Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University
How is it done ?
for now the what i do is to merge 2 crawl into a new one :
bin/nutch mergedb crawl/crawldb crawl1/crawldb/ crawl2/crawldb/
Is that the only solution ?
. If anyone wants
more information, let me know.
--
Eric J. Christeson
[EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164
Organizations which design systems
when you recompiled?
eric
--
Eric J. Christeson
[EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164
Organizations which design systems are constrained
for abc,go, com and folder next to
each other.
What is the proper syntax?
url:abc url:folder produces quite a few wrong results.
It should work with with url:abc go com folder
eric
--
Eric J. Christeson
[EMAIL PROTECTED]
Information Technology Services
/api/
RobotRulesParser.java
parses robots.txt
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/
HTMLMetaProcessor.java
parses robot rules from html documents.
eric
Eric J. Christeson
[EMAIL PROTECTED]
Information Technology Services (701
]*\.)*geekzone.co.nz/blog.asp\?blogid=207
You'll have to comment out the default ? killer or put this rule before
it.
Maybe there's something I'm missing, though.
Eric
--
Eric J. Christeson [EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC
hasn't passed yet.
eric
Eric J. Christeson
[EMAIL PROTECTED]
Information Technology Services (701) 231-8693 (Voice)
Room 242C, IACC Building
North Dakota State University, Fargo, ND 58105-5164
Organizations which design systems are constrained
the error.
Did I do somthing wrong? Please help. Thanks.
- Eric
*type* Exception report
*message*
*description* *The server encountered an internal error () that prevented it
from fulfilling this request.*
*exception*
org.apache.jasper.JasperException
hello, I intalled nutch on resin3.0. When I open search.jsp, I get this
error below.
search.jsp line 116 is - i18n:bundle baseName=org.nutch.jsp.search
/
any thoughts?
Thank you
java.util.MissingResourceException: Can't find bundle for base name
org.nutch.jsp.search,
locale ko_KR
at
depth, but in second depth, nutch0.7 fetches only 15 urls while
nutch0.7fetches 34 urls. Of course, the configuration and settings
are same.
Can you tell me why I get this results?
Thank you
Eric Park
log for Nutch 0.6 crawler
--
060328 182513 logging at INFO
060328 182513 fetching
can't figure out why it filters out urls starting with
'www' in second depth. Nutch 6.0 works just fine. Are there any known bugs
in Nutch7.0 crawler?
thank you,
Erci Park
2006/4/12, Andrzej Bialecki [EMAIL PROTECTED]:
eric park wrote:
hello. I tried to crawl a certain site using both nutch 0.6
51 matches
Mail list logo