them over is time
trivial...
also the workaround is annoying knowing that we have a protocl-file
plugin.
Thanks for help
Lewis
On Wednesday, August 7, 2013, Bai Shen baishen.li...@gmail.com wrote:
Is it possible to run a web server and connect to them that way? That
was
what I
it was the problem. However, the behavior
occurs with both it and the default scheduler.
Did you then start from scratch again? Otherwise the next fetch time
is still far in the future and the fetch interval keeps too large.
Sebastian
On 08/07/2013 03:30 PM, Bai Shen wrote:
Sorry
Is it possible to run a web server and connect to them that way? That was
what I ended up doing.
On Tue, Aug 6, 2013 at 4:58 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi,
Struggling with this one. And yes I acknowledge that it is not really a
Nutch based question but
db.fetch.schedule.adaptive.max_interval
db.fetch.schedule.adaptive.sync_delta
Sebastian
On 07/17/2013 06:58 PM, Bai Shen wrote:
I'm using Nutch 2.x HEAD with the default scheduler. I have the max
fetch
interval set to one week and the fetch interval set to one day.
Everything seems to work
I'm using Nutch 2.x HEAD with the default scheduler. I have the max fetch
interval set to one week and the fetch interval set to one day.
Everything seems to work correctly for a while. Pages show up as fetched
with a fetch time of the next day. However, after a couple of days
generate
.com
So when fetch has fetched say page1, page2, page3 from url1 and
page4,page5,page6 from url2, after the crawl, how do I tell that page4 is
from url2.com and page1 is from url1.com?
On Thu, Jul 11, 2013 at 10:54 AM, Bai Shen baishen.li...@gmail.com
wrote:
Yes, generate marks the urls
in HBase,
most
probably it is either 404 or status other than 200.
Hope this helps.
On Fri, May 24, 2013 at 8:13 AM, Bai Shen baishen.li...@gmail.com
wrote:
I'm running Nutch 2.1 using HBase.
When I run readdb -stats I show that there are 15k unfetched urls.
However, when I run generate
The crawl script doesn't accept Batch ID. So in order to use Batch ID you
would run the commands separately which would not involve depth. Depth is
just the number of times to run the generate, fetch, parse, update cycle.
Any unfetched pages will not have a Batch ID. The Batch ID only applies
Have you done another crawl? By default, Nutch puts the redirect into the
database as a new url to be crawled. So you will find the content under
the location of the redirect.
If I remember correctly, there used to be a setting that would have Nutch
follow the redirect instead of storing it as
This isn't what Batch ID is for. If you're crawling on only the one server
and only want that specific section, use the regex-urlfilter to accept only
the specific pages you want.
On Tue, Jul 9, 2013 at 3:36 PM, h b hb6...@gmail.com wrote:
Hi
Use case:
* Scrape a given url. e.g.
, Jul 11, 2013 at 4:25 AM, Bai Shen baishen.li...@gmail.com wrote:
This isn't what Batch ID is for. If you're crawling on only the one
server
and only want that specific section, use the regex-urlfilter to accept
only
the specific pages you want.
On Tue, Jul 9, 2013 at 3:36 PM, h b hb6
I'm dealing with a lot of file types that I don't want to index. I was
originally using the regex filter to exclude them but it was getting out of
hand.
I changed my plugin includes from
urlfilter-regex
to
urlfilter-(regex|suffix)
I've tried using both the default urlfilter-suffix.txt file
Sorry. I forgot to mention that I'm running a 2.x release taken from a few
weeks ago.
On Wed, Jun 12, 2013 at 8:31 AM, Bai Shen baishen.li...@gmail.com wrote:
I'm dealing with a lot of file types that I don't want to index. I was
originally using the regex filter to exclude them
I figured as much, which is why I'm not sure why it's not working for me.
I ran bin/nutch org.apache.nutch.net.URLFilterChecker
http://myserver/myurland it's been thirty minutes with no results.
Is there something I should run before running that?
Thanks.
On Wed, Jun 12, 2013 at 8:34 AM,
Doh! I really should just read the code of things before posting.
I ran the URLFilterChecker and passed it in a url that the SuffixFilter
should flag and it still passed it. However, if I change the url to end in
a format that is in the default config file, it rejects the url.
So it looks like
Turns out it was because I had a copy of the default file sitting in the
directory I was calling nutch from.
Once I removed that it correctly found my copy in the conf directory.
On Wed, Jun 12, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote:
Doh! I really should just read the code
4, 2013 at 4:31 PM, Bai Shen baishen.li...@gmail.com wrote:
I dropped the f family from HBase and readded it. Nutch filled in the
columns and now I have sane fetch times.
However, my fetchInterval is not being populated and every time I run a
crawl I get the same urls. Here is my metadata
attempted to be fetched but that failed
and so their retry interval was incremented to a larger value. Can't say
for sure though.
Can you share the crawl datum ? The status and meta fields can give some
clue.
On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen baishen.li...@gmail.com wrote:
The time
: 1370388764041
prevFetchTime: 0
fetchInterval: 0
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: SUCCESS
parseStatus: success/ok
Any ideas why?
On Tue, Jun 4, 2013 at 8:16 AM, Bai Shen baishen.li...@gmail.com wrote:
I'm looking at my base url(the root of the internal site that I
I'm using the 2.x head and even with adding 30 days I'm not getting any
refetches. I did a readdb on my injected url and it says that the fetch
time is in 2027.
Any idea why this would occur? Will db.fetch.interval.max kick in and
cause it to be fetched earlier? Or will I have to manually
3, 2013 at 8:57 PM, Bai Shen baishen.li...@gmail.com
wrote:
I'm using the 2.x head and even with adding 30 days I'm not getting any
refetches. I did a readdb on my injected url and it says that the
fetch
time is in 2027.
Can share the crawl datum for that url ?
Any idea
The issue with CDH4 is that it uses a newer version of HBase. If Gora 0.3
now supports versions of HBase newer than 0.90 it should fix the CDH4
issues.
However, from what I've read, Gora won't support newer HBase versions until
0.4.
On Mon, Jun 3, 2013 at 10:24 AM, Tejas Patil
that unfetched status is not updated in Nutch.
I also faced a similar problem [1]. Please open a jira and report any
findings.
[1] http://find.searchhub.org/document/6e4464919811d20f#c2a5de6e93942ada
On Fri, May 24, 2013 at 10:03 AM, Bai Shen baishen.li...@gmail.com
wrote:
I'm trying
I'm running Nutch 2.1 using HBase.
When I run readdb -stats I show that there are 15k unfetched urls.
However, when I run generate -topN 1000 I get no urls to be fetched. Up
until now it's been pulling a full thousand urls for each cycle.
Any ideas? I'm not sure what to check.
Thanks.
other than 200.
Hope this helps.
On Fri, May 24, 2013 at 8:13 AM, Bai Shen baishen.li...@gmail.com wrote:
I'm running Nutch 2.1 using HBase.
When I run readdb -stats I show that there are 15k unfetched urls.
However, when I run generate -topN 1000 I get no urls to be fetched. Up
I just tested the GeneratorJob portion and it works fine. I have two
comments, though.
1. I added braces around the -batchId arg if statement. I don't like if's
without them.
2. BatchIds never get cleared. So if you use the same batchId for
multiple crawl cycles your urls per batch will
I'm trying to set up nutch using HEAD instead of 2.1. I went to change
ivy.xml to uncomment the HBase dependency before calling ant and it's not
there.
Has this been removed? Is there a new way to set up HBase integration?
Thanks.
NM, I grabbed trunk instead of 2.x.
On Mon, May 13, 2013 at 7:25 AM, Bai Shen baishen.li...@gmail.com wrote:
I'm trying to set up nutch using HEAD instead of 2.1. I went to change
ivy.xml to uncomment the HBase dependency before calling ant and it's not
there.
Has this been removed
you to upgrade to 2.x HEAD.
On Wed, May 1, 2013 at 4:32 AM, Bai Shen baishen.li...@gmail.com wrote:
My crawl loop consists of the following.
generate -topN
fetch -all
parse -all
updatedb
solrindex -all
With the fetch and parse the -all only pulls the batch that was
generated
?
On Wed, May 1, 2013 at 7:32 AM, Bai Shen baishen.li...@gmail.com wrote:
My crawl loop consists of the following.
generate -topN
fetch -all
parse -all
updatedb
solrindex -all
With the fetch and parse the -all only pulls the batch that was
generated,
skipping all of the other
I'm using 2.1.
Are there any other notable changes for using the HEAD instead?
On Wed, May 1, 2013 at 2:13 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
What version are you using?
If you can I would advise you to upgrade to 2.x HEAD.
On Wed, May 1, 2013 at 4:32 AM, Bai Shen
Apologies for the cross posting, but I'm not sure where the configuration
I'm missing lies.
I have Nutch 2.1 connecting to HBase 0.90.6 with everything running locally
on the same machine. Now I'm trying to move them to separate machines.
I added hbase.zookeeper.quorum in the Nutch
It turned out to be an /etc/hosts issue.
I needed to remove the hbase host name from 127.0.0.1 and add a separate
line with the machines ip and host name. Then I had to duplicate that on
the Nutch machine.
On Thu, May 2, 2013 at 9:29 AM, Bai Shen baishen.li...@gmail.com wrote:
Apologies
: It does not explicitly trigger a server side commit.
/description
/property
On Thu, Apr 25, 2013 at 4:35 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm having two problems with the solrindex job in Nutch 2.1
When I run it with -all, it indexes every single parsed document, not
just
relaxing my other settings to see if the errors and hung
threads come back yet.
On Tue, Apr 30, 2013 at 12:50 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
That would be very much appreciated.
Lewis
On Tue, Apr 30, 2013 at 5:00 AM, Bai Shen baishen.li...@gmail.com wrote:
I'll
I'll let you know if I figure out any good defaults.
Thanks.
On Sat, Apr 27, 2013 at 5:30 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Bai,
On Thu, Apr 25, 2013 at 4:33 AM, Bai Shen baishen.li...@gmail.com wrote:
Well, I still ended up having to set a content limit
Is there a way to remove the files that fetched files from HBase after
they've been parsed? I'm running things locally and don't have the storage
space to store all of the fetched files.
Thanks.
a question for the Gora experts.
I was finally able to get the fetch to complete by setting the Nutch heap
to 4GB and the HBase heap to 4GB.
A heap size 4 times the document size doesn't seem that much ;-)
On 04/24/2013 01:34 PM, Bai Shen wrote:
It doesn't take that long on my local machine
I'm having two problems with the solrindex job in Nutch 2.1
When I run it with -all, it indexes every single parsed document, not just
the newly generated ones, as fetch and parse do.
Secondly, it's adding my documents in small chunks. I was fetching in 100
document cycles and when I run
these values, but a single fetch should never take
5min.
Sebastian
On 04/23/2013 06:17 PM, Bai Shen wrote:
Anything larger than the default http.content.limit.
I'm crawling an internal server and we have some large files. That's
why I
had increased the heap size to 8G. When I run
of the files which were truncated?
thank you
Lewis
On Tuesday, April 23, 2013, Bai Shen baishen.li...@gmail.com wrote:
I just set http.content.limit back to the default and my fetch completed
successfully on the server. However, it truncated several of my files.
Also, my server is running
I'm crawling a local server. I have Nutch 2 working on a local machine
with the default 1G heap size. I got several OOM errors, but the fetch
eventually finishes.
In order to get rid of the OOM errors, I moved everything to a machine with
more memory and increased the heap size to 8G. However,
://stackoverflow.com/questions/10331440/nutch-fetcher-aborting-with-n-hung-threads
Also keep track of:
https://issues.apache.org/jira/browse/NUTCH-1182
Sebastian
On 04/22/2013 08:18 PM, Bai Shen wrote:
I'm crawling a local server. I have Nutch 2 working on a local machine
with the default 1G
keep track of:
https://issues.apache.org/jira/browse/NUTCH-1182
Sebastian
On 04/22/2013 08:18 PM, Bai Shen wrote:
I'm crawling a local server. I have Nutch 2 working on a local machine
with the default 1G heap size. I got several OOM errors, but the fetch
eventually finishes.
In order
at 12:26 PM, Bai Shen baishen.li...@gmail.com
wrote:
I'm trying to crawl a local file system. I've made the changes to not
ignore file urls and added protocol-file to the plugins list. I've
included file:///data/mydir in my url fille.
However, when I run the fetch, Nutch tries to connect
.
I don't know about the progress on it. There is most certainly
open/resolved tickets for it on Jira please look there.
Thank you
Lewis
On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen baishen.li...@gmail.com
wrote:
I'm trying to crawl a local file system. I've made the changes to not
ignore
I'm trying to crawl a local file system. I've made the changes to not
ignore file urls and added protocol-file to the plugins list. I've
included file:///data/mydir in my url fille.
However, when I run the fetch, Nutch tries to connect to file://data/mydir
and therefore returns a 404 error. I
NM. I didn't realize the distributed launcher ignores NUTCH_HEAP_SIZE.
You have to use HADOOP_HEAP_SIZE.
On Wed, Oct 17, 2012 at 3:33 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm running Nutch in a distributed deployment using LocalJobRunner. I'm
trying to increase the heap size
Doh! I misread the code when I was looking at it.
Thanks.
On Fri, Oct 12, 2012 at 10:04 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Just look at the existing plugins e.g. anchor : call add() as many times as
you have values
On 12 October 2012 14:43, Bai Shen baishen.li
are
big, or change a lot.
On Thu, Oct 11, 2012 at 1:20 PM, Bai Shen baishen.li...@gmail.com wrote:
I need to reference a file from my plugin. However, when I try to call
it
using File(blah.txt), it looks for the file at the location where I run
nutch from, not in the job file.
What
.
On Thu, Oct 11, 2012 at 2:52 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm trying the first method now. However, some of my code requires a
file
object that is a directory. So I can't use the first method with it.
If I put something in the /classes, what path do I reference it with?
Do
, 2012 at 3:41 PM, Bai Shen baishen.li...@gmail.com wrote:
I'll give that a try. I know when I was doing new File(dir/blah.txt)
it
was looking in the directory on the client machine that I was running the
job from.
On Thu, Oct 11, 2012 at 9:30 AM, Ferdy Galema ferdy.gal...@kalooga.com
is expanded into the current
working directory of a running task.
Last but not least you are able to use shared filesystem, for example the
HDFS or the mapreduce DistributedCache. This is useful if the files are
big, or change a lot.
On Thu, Oct 11, 2012 at 1:20 PM, Bai Shen baishen.li
I just tried to run it and I'm getting the following bug on CDH4.
https://issues.apache.org/jira/browse/NUTCH-1447
On Mon, Oct 1, 2012 at 8:17 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi All,
Anyone else for this VOTE?
Sorry to be a pest!
Thanks
Lewis
On Fri, Sep
it to work on other
distribution then the better it is but this can't be considered a bug or a
blocker for the release
On 3 October 2012 14:10, Bai Shen baishen.li...@gmail.com wrote:
I just tried to run it and I'm getting the following bug on CDH4.
https://issues.apache.org/jira/browse
I'm getting the following error trying to run Nutch 2 on CDH4. A lot of
people seem to be running into this problem, and there's even a JIRA bug
opened, but I haven't seen any comments on it. Anybody have any
suggestions to work around it?
https://issues.apache.org/jira/browse/NUTCH-1447
. It is the intention
to integrate this into 2.x one it has been tested enough. The glitch
you highlight is exactly the type of stuff we need to find.
Thanks
Lewis
[0] https://issues.apache.org/jira/browse/NUTCH-1087
On Fri, Sep 28, 2012 at 2:52 PM, Bai Shen baishen.li...@gmail.com wrote:
When
Two things to note. db.ignore.external.links doesn't quite work the way
you think it should. If you have a url inside the domain that resolves to
a url outside the domain, nutch will end up indexing that domain as well.
The way to get around this is to use the whitelist instead of
I'm attempting to get Nutch 2 running on a CDH4 cluster. I'm having the
following problem.
[root@node4-0 nutch]# bin/nutch readdb -stats
bin/nutch: line 98: [: /opt/nutch/apache-nutch-2.1-SNAPSHOT.job: binary
operator expected
Error: Could not find or load main class
lewis.mcgibb...@gmail.com wrote:
Solr logs?
On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li...@gmail.com wrote:
I have a nutch 2 setup that I got working with solr about a month ago. I
had to shelve it for a little while and I've recently come back to it.
Everything seems to be working
The problem appears to be that Nutch is not sending anything to solr. But
I can't seem to find a reason in nutch as to why this is.
On Sat, Sep 15, 2012 at 7:36 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Solr logs?
On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li
Sorry, I keep forgetting. I'm using the Nutch 2.x branch as of last week.
However, there hasn't been a change to the filter in a month or so.
It was parsed correctly as far as I can tell. I'm seeing the same content
in solr as what I see in the browser.
On Mon, Aug 13, 2012 at 9:19 AM, Markus
with this parameters:
property
namemapred.child.java.opts/name
value-Xmx1600m -XX:-UseGCOverheadLimit
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/tmp/value
/property
On Wed, Aug 8, 2012 at 9:32 PM, Bai Shen baishen.li...@gmail.com
wrote:
Is this something other people
Is this something other people are seeing? I was parsing 10k urls when I
got this exception. I'm running Nutch 2 head as of Aug 6 with the default
memory settings(1 GB).
Just wondering if anybody else has experienced this on Nutch 2.
Thanks.
error.
So to fix it: Apply patch in
NUTCH-1444https://issues.apache.org/jira/browse/NUTCH-1444 or
update to Nutch2x head.
Ferdy
On Mon, Aug 6, 2012 at 9:21 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm working on writing a Nutch 2 plugin. Whenever something is
configured
wrong, I
in
NUTCH-1444https://issues.apache.org/jira/browse/NUTCH-1444 or
update to Nutch2x head.
Ferdy
On Mon, Aug 6, 2012 at 9:21 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm working on writing a Nutch 2 plugin. Whenever something is
configured
wrong, I don't get any valid logging info
I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen baishen.li...@gmail.com wrote:
I set up Nutch 2.x with a new instance of HBase. I ran the following
commands.
bin/nutch inject urls
bin
I'm trying to crawl using Nutch 2. However, I can't seem to get it to
index to solr without adding -reindex to the command. And at that point it
indexes everything I've crawled. I've tried both -all and the batch id,
but neither one results in anything being indexed to solr.
Any suggestions of
Is there a specific place it's located? I turned on debugging, but I'm not
seeing a batch id.
On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Can you stick on debug logging and see what the batch ID's actually are?
On Mon, Jul 30, 2012 at 6:12 PM, Bai
updatedb.
So, each generate command assigned different batchId s to its own set of
urls.
Alex.
-Original Message-
From: Bai Shen baishen.li...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jul 31, 2012 10:26 am
Subject: Re: Different batch id
Is there a specific place it's
It's your topN. You're pulling too many urls at one time. At the end of
each fetch map task, Nutch pulls all of the downloaded data into memory in
order to do the merge and sort. If that data is more than your allocated
memory, then you'll get a java heap exception.
There are only two ways to
Currently, I have roughly 10M records in my crawldb. I added some regex's
to remove some urls from my crawldb. Nothing complicated. However, when I
run with filtering turned on, the updatedb job took 118 hours.
Looking in the regex-urlfilter.txt file, I noticed some of the other
regex's are
be necessary any
more.
On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen baishen.li...@gmail.com wrote:
I'm attempting to filter during the generating. I removed the noFilter
and
noNorm flags from my generate job. I have around 10M records in my
crawldb.
The generate job has been running
I'm attempting to filter during the generating. I removed the noFilter and
noNorm flags from my generate job. I have around 10M records in my crawldb.
The generate job has been running for several days now. Is there a way to
check it's progress and/or make sure it's not hung?
Also, is there a
Somehow my crawler started fetching youtube. I'm not really sure why as I
have db.ignore.external.links set to true.
I've since added the following line to my regex-urlfilter.txt file.
-^http://www\.youtube\.com/
However, I'm still seeing youtube urls in the fetch logs. I'm using the
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Tue 22-May-2012 19:40
To: user@nutch.apache.org
Subject: URL filtering and normalization
Somehow my crawler started fetching
On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
-Original message-
From:Bai Shen baishen.li...@gmail.com
Sent: Tue 22-May-2012 19:40
To: user@nutch.apache.org
Subject: URL filtering and normalization
Somehow my crawler started fetching
parse an HTML file with a .txt extension just as a normal
HTML file, at least, here it does. What does your parserchecker say? In any
case you must strip potential left-over HTML in your Solr analyzer, if left
like this it's a bad XSS vulnerability.
Cheers
On Tue, 8 May 2012 08:34:58 -0400, Bai
Nutch ended up crawling some HTML files that had a TXT extension. Because
of this(I assume), it didn't strip out the HTML. So now I have weird
formatting on my results page.
Is there a way to fix this on the Nutch side so it doesn't happen again?
I'm working on using Shuyo's work to improve the language identification of
our search. Apparently, it's been moved from Nutch to Solr. Is there a
reason for this?
http://code.google.com/p/language-detection/issues/detail?id=34
I would prefer to have the processing done in Nutch as that has
I ended up using a shell script instead of doing it in Java, so I never
spent any more time investigating it.
On Wed, Feb 8, 2012 at 2:10 PM, webdev1977 webdev1...@gmail.com wrote:
I am curious as to what ever came of this? I am having the exact same issue
with nutch 1.3
--
View this
segments, parse them, then merge
incrementally rather than attempting to merge several larger segments at
once?
Are you getting any IO problems when parsing the segments? If so this may
be an early warning light to attack the problem from another angle.
On Thu, Jan 12, 2012 at 4:41 PM, Bai Shen
I'm using nutch in distributed mode. I'm crawling large files(a bunch of
videos), and when the fetcher map job goes to merge the spill files in
order to send to the reduce I'm getting an OOM exception. It appears to be
because the merge is attempting to merge the data from all of the fetched
with bash and fetch and parse.
On Thursday 29 December 2011 21:29:02 Bai Shen wrote:
Currently, I'm using a shell script to run my nutch crawl. It seems to
work okay, but it only generates one segment at a time. Does anybody
have
any suggestions for how to improve it, make it work
script instead since that doesn't have any of
the problems.
On Fri, Dec 23, 2011 at 5:08 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
On Thursday 22 December 2011 19:36:29 Bai Shen wrote:
How does the whole multiple segments work?
Use the generator to create multipel segments in one go
schema field for title multiValued and deal with
it appropriately in your search front-end.
Cheers
On Wednesday 02 November 2011 15:02:11 Bai Shen wrote:
Found it right after I asked. :) BTW, the command is wrong on the
wiki.
I
need to get around to making
both just infinite
loops that call the various nutch parts in order.
On Mon, Dec 19, 2011 at 10:08 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
On Monday 19 December 2011 15:57:02 Bai Shen wrote:
AFAIK, mapred.map.child.java.opts is not set, but I'll double check.
When you say
I'm using the default db.fetch.retry.max value of 3, but I'm seeing retry
counts as high as 14 in the crawldb stats output. Any ideas why this is
and how to change it?
the parsing of these segments that is taking time... no?
On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen baishen.li...@gmail.com wrote:
On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This is overwhelmingly weighted towards Hadoop configuration
I've tried running Nutch in local, psuedo, and full distributed mode, and I
keep getting OutOfMemoryErrors. I'm running Nutch using a slightly
modified version of the Crawler code that's included. Basically, I've
modified it to continously crawl instead of stopping after a set number of
cycles.
where to find it
in Cloudera.
On Fri, Dec 16, 2011 at 11:38 AM, Markus Jelsma
markus.jel...@openindex.iowrote:
What jobs exit with OOM? What is your heap size for the mapper and reducer?
On Friday 16 December 2011 17:13:45 Bai Shen wrote:
I've tried running Nutch in local, psuedo, and full
So I have Nutch running on a hadoop cluster with three data nodes. The
machines are all pretty beefy, but Nutch isn't performing any faster than
when I was running in pseudo mode on one machine.
How to I set Nutch in order to take full advantage of the cluster?
Thanks.
On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
This is overwhelmingly weighted towards Hadoop configuration.
There are some guidance notes on the Nutch wiki for performance issues
so you may wish to give them a try first.
--
Lewis
I'm assuming
I'm definitely interested in such a tool. We've been considering moving to
ElasticSearch, so if you need a tester, let me know.
On Mon, Dec 5, 2011 at 6:38 PM, Tim Pease tim.pe...@gmail.com wrote:
I am in the process of writing a new Nutch tool that will index documents
into the ElasticSearch
a
ClassNotFoundException.
On Mon, Nov 28, 2011 at 3:38 PM, Bai Shen baishen.li...@gmail.com wrote:
I've changed nutch to use the pseudo-distributed mode, but it keeps
erroring out that no agent is listed in the http.agent.name property. I
copied over my conf directory from local, but that didn't fix
On 28 November 2011 14:09, Bai Shen baishen.li...@gmail.com wrote:
We looked at the hadoop reporter and aren't sure how to access it with
nutch. Is there a certain way it works? Can you give me an example?
Thanks.
On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
markus.jel
On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Interesting. How do you tell if the segments have been fetched, etc?
after a job the shell script waits for its completion and return code. If
it
returns 0 all is fine and we move it to another queue. If != 0
through
the
net.
On Mon, Oct 31, 2011 at 4:47 PM, Bai Shen baishen.li...@gmail.com
wrote:
We just did an ant clean and rebuilt nutch and we're still seeing the
same
error in the logs.
On Fri, Oct 28, 2011 at 11:12 AM, Markus Jelsma
markus.jel
Can you give me an example of how would I set my URL filter to do this?
Right now I'm just using the default.
On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi
Write an regex URL filter and use it the next time you update the db; it
will
disappear. Be sure
. The URL's will then be deleted.
On Thursday 10 November 2011 16:36:24 Bai Shen wrote:
Can you give me an example of how would I set my URL filter to do this?
Right now I'm just using the default.
On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma
markus.jel...@openindex.iowrote:
Hi
1 - 100 of 151 matches
Mail list logo