Re: nutch with cassandra internal network usage

2013-03-04 Thread Julien Nioche
Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have

Re: nutch with cassandra internal network usage

2013-03-04 Thread Roland
as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013

Re: nutch with cassandra internal network usage

2013-03-03 Thread Roland
. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator

Re: nutch with cassandra internal network usage

2013-02-22 Thread Lewis John Mcgibbney
lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters. Its mapper goes over all records as far

Re: nutch with cassandra internal network usage

2013-02-22 Thread Julien Nioche
, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go

Re: nutch with cassandra internal network usage

2013-02-22 Thread Roland
hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage

Re: nutch with cassandra internal network usage

2013-02-22 Thread Roland
Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters

Re: nutch with cassandra internal network usage

2013-02-21 Thread Lewis John Mcgibbney
records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11

Re: nutch with cassandra internal network usage

2013-02-21 Thread Roland
Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters. Its mapper goes over all

nutch with cassandra internal network usage

2013-02-20 Thread Roland
Hi list, we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts). Our cassandra 'webpage' store has about 31GB right now on disk, we add URLs by 'injecting' them, about 100k-300k per cycle. When starting a 'fetch' run, it now needs about an hour before the queues are set up

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi Roland, You say you start a fetch run, does this mean the FetcherJob or GeneratorJob? What kind of settings do you run your zNutch server with? On Wednesday, February 20, 2013, Roland rol...@rvh-gmbh.de wrote: Hi list, we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).

Re: nutch with cassandra internal network usage

2013-02-20 Thread Roland
Hi Lewis, the GeneratorJob takes only ~5 minutes. I'm running it in standalone mode, like this: ./bin/nutch fetch 1361367698-1708119958 -threads 40 It's configured to fetch parse, but it makes no difference if it only fetches: FetcherJob: starting FetcherJob: batchId: 1361367698-1708119958

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
I am assuming that your generate.max.count property value is set to the default -1? Have you tried configuring more, smaller batchId's (fetch lists)? I don't have an immediate answer as to why overall, the FetcherJob is taking this amount of time and resources On Wednesday, February 20, 2013,

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
. Alex. -Original Message- From: Roland rol...@rvh-gmbh.de To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 10:56 am Subject: Re: nutch with cassandra internal network usage Hi Lewis, the GeneratorJob takes only ~5 minutes. I'm running it in standalone mode, like this: ./bin

Re: nutch with cassandra internal network usage

2013-02-20 Thread Roland
Hi Alex, the GeneratorJob seems to have a solution for that, if not it would iterate over all records too, am I right? --Roland Am 20.02.2013 20:42, schrieb alx...@aim.com: Hi, This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
-gmbh.de To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 11:47 am Subject: Re: nutch with cassandra internal network usage Hi Alex, the GeneratorJob seems to have a solution for that, if not it would iterate over all records too, am I right? --Roland Am 20.02.2013 20:42, schrieb alx

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi, Please head over to most recent thread on dev@ for potential improvements for the Generator* code. Thanks for invoking this discussion, it is well overdue. Lewis On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alex, On Wed, Feb 20, 2013 at

Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also

Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote: The generator also does not have filters