Re: get partition key from tombstone warnings?

2015-01-22 Thread Paulo Ricardo Motta Gomes
Yep, you may register and log into the Apache JIRA and click Vote for this
issue, in the upper right-side of the ticket.

On Wed, Jan 21, 2015 at 11:30 PM, Ian Rose ianr...@fullstory.com wrote:

 Ah, thanks for the pointer Philip.  Is there any kind of formal way to
 vote up issues?  I'm assuming that adding a comment of +1 or the like
 is more likely to be *counter*productive.

 - Ian


 On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson 
 philip.thomp...@datastax.com wrote:

 There is an open ticket for this improvement at
 https://issues.apache.org/jira/browse/CASSANDRA-8561

 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote:

 When I see a warning like Read 9 live and 5769 tombstoned cells in ...
 etc is there a way for me to see the partition key that this query was
 operating on?

 The description in the original JIRA ticket (
 https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
 exposing this information was one of the original goals, but it isn't
 obvious to me in the logs...

 Cheers!
 - Ian






-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br http://www.chaordic.com.br/*
+55 48 3232.3200


Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-22 Thread Kai Wang
In last year's summit there was a presentation from Instaclustr -
https://www.instaclustr.com/meetups/presentation-by-ben-bromhead-at-cassandra-summit-2014-san-francisco/.
It could be the solution you are looking for. However I don't see the code
being checked in or JIRA being created. So for now you'd better plan the
capacity carefully.

On Wed, Jan 21, 2015 at 11:21 PM, Yatong Zhang bluefl...@gmail.com wrote:

 Yes, my cluster is almost full and there are lots of pending tasks. You
 helped me a lot and thank you Eric~

 On Thu, Jan 22, 2015 at 11:59 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, bootstrapping a new node will cause read loads on your existing
 nodes - it is becoming the owner and replica of a whole new set of existing
 data.  To do that it needs to know what data it's now responsible for, and
 that's what bootstrapping is for.

 If you're at the point where bootstrapping a new node is placing a
 too-heavy burden on your existing nodes, you may be dangerously close to or
 even past the tipping point where you ought to have already grown your
 cluster.  You need to grow your cluster as soon as possible, and chances
 are you're close to no longer being able to keep up with compaction (see
 nodetool compactionstats, make sure pending tasks is 5, preferably 0 or
 1).  Once you're falling behind on compaction, it becomes difficult to
 successfully bootstrap new nodes, and you're in a very tough spot.


 On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Thanks for the reply. The bootstrap of new node put a heavy burden on
 the whole cluster and I don't know why. So that' the issue I want to fix
 actually.

 On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, but it won't do what I suspect you're hoping for.  If you disable
 auto_bootstrap in cassandra.yaml the node will join the cluster and will
 not stream any old data from existing nodes.

 The cluster will now be in an inconsistent state.  If you bring enough
 nodes online this way to violate your read consistency level (eg RF=3,
 CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
 missing data that they ought to have returned.

 There is no way to bring a new node online and have it be responsible
 just for new data, and have no responsibility for old data.  It *will* be
 responsible for old data, it just won't *know* about the old data it
 should be responsible for.  Executing a repair will fix this, but only
 because the existing nodes will stream all the missing data to the new
 node.  This will create more pressure on your cluster than just normal
 bootstrapping would have.

 I can't think of any reason you'd want to do that unless you needed to
 grow your cluster really quickly, and were ok with corrupting your old 
 data.

 On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the 
 whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?








Re: Re: Dynamic Columns

2015-01-22 Thread Peter Lin
@jack thanks for taking time to respond. I agree I could totally redesign
and rewrite it to fit in the newer CQL3 model, but are you really
recommending I throw 4 years of work out and completely rewrite code that
works and has been tested?

Ignoring the practical aspects for now and exploring the topic a bit
further. Since not everyone has spent 5+ years designing and building
temporal databases, it's probably good to go over some fundamental theory
at the risk of boring the hell out of everyone.

1. a temporal record ideally should have 1 unique key for the entire life.
I've done other approaches in the past with composite keys in RDBMS and it
sucks. Could it work? Yes, but I already know from first hand experience
how much pain that causes when temporal records need to be moved around, or
when you need to audit the data for law suits.
2. the life of a temporal record may have n versions and no two versions
are guaranteed to be identical in structure and definitely not in content
3. the temporal metadata about each entity like version, previous version,
branch, previous branch, create date, last transaction and version counter
are required for each record. Those are the only required static columns
and they are managed by the framework. User's aren't suppose to manually
screw with those metadata columns, but obviously they could by going into
cqlsh.
4. at any time, import and export of a single temporal record with all
versions and metadata could occur, so optimal storage and design is
critical. for example, if someone wants to copy or move 100,000 records
from one cluster to another cluster and retain the history.
5. the user can query for all or any number of versions of a temporal
record by version number or branch. For example, it's common to get version
10 and do a comparison against version 12. It's just like doing a diff in
SVN, git, cvs, but for business data. Unlike text files, a diff on business
records is a bit more complex, since it's an object graph.
6. versioning and branching of temporal records relies on business logic,
which can be the stock algorithm or defined by the user
7. Saving and retrieving data has to be predictable and quick. This means
ideally all versions are in the same row and on the same node. Pre CQL and
composite keys, storing data in different rows meant it could be on
different nodes. Thankfully with composite keys, Cassandra will use the
first column as the partition key.

In terms of adding dynamic_column_name as part of the composite key, that
isn't ideal in my use case for several reasons.

1. a record might not have any dynamic columns at all. The user decides
this. The only thing the framework requires is an unique key that doesn't
collide. If the user chooses their own key instead of a UUID, the system
checks for collision before saving a new record.

2. we use dynamic columns to provide projections, aka views of a temporal
entity. This means we can extract fields nested deep in the graph and store
it as a dynamic column to avoid reading the entire object. Unlike other
kinds of use cases of dynamic column, the column name and value will vary.
I know it's popular to use dynamic columns to store time series data like
user click stream, but that has the same type.

3. we allow the user to index secondary columns, but on read we always use
the value in the object. We also integrated solr to give us more advanced
indexing features.

4. we provide an object API to make temporal queries easy. It's
modeled/inspired by JPA/Hibernate. We could have invented another text
query language or tried to use tsql2, but an object API feels more
intuitive to me.

Could I fit a square peg into a round hole? Yes, but does that make any
sense? If I was building a whole new temporal database from scratch, I
might do things different. I couldn't use CQL3 back in 2008/2009, so I
couldn't have used it. Aside from all of that, an object API is more
natural for temporal databases. The model really is an object graph and not
separate database tables stitched together. Any change to any part of the
record requires versioning it and handling it correctly. Having built
temporal databases on RDBMS, using SQL meant building a general purpose
object API to make things easier. This is due to the need to be database
agnostic, so we couldn't use the object API that is available in some
databases. Hopefully that helps provide context and details. I don't expect
people to have a deep understanding of temporal database from my ramblings,
given it took me over 8 years to learn all of this stuff.


On Thu, Jan 22, 2015 at 12:51 AM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Peter,

 At least from your description, the proposed use of the clustering column
 name seems at first blush to fully fit the bill. The point is not that the
 resulting clustered primary key is used to reference an object, but that a
 SELECT on the partition key references the entire object, which will be a
 sequence of CQL3 

Cassandra 2.1.2, Pig 0.14, Hadoop 2.6.0 does not work together

2015-01-22 Thread Pinak Pani
I am using Pig with Cassandra (Cassandra 2.1.2, Pig 0.14, Hadoop 2.6.0
combo).

When I use CqlStorage() I get

org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
org.apache.cassandra.exceptions.ConfigurationException: Unable to find
inputformat class 'org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat/

When I use CqlNativeStorage() I get

java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

Pig classpath looks like this:

» echo $PIG_CLASSPATH

/home/naishe/apps/apache-cassandra-2.1.2/lib/airline-0.6.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/antlr-runtime-3.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-clientutil-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-thrift-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-cli-1.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-codec-1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-lang3-3.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-math3-3.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/compress-lzf-0.8.4.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/concurrentlinkedhashmap-lru-1.4.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/disruptor-3.0.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/
*guava-16.0.jar*
:/home/naishe/apps/apache-cassandra-2.1.2/lib/high-scale-lib-1.0.6.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jackson-core-asl-1.9.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jackson-mapper-asl-1.9.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jamm-0.2.8.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/javax.inject.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jbcrypt-0.3m.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jline-1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jna-4.0.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/json-simple-1.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/libthrift-0.9.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/logback-classic-1.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/logback-core-1.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/lz4-1.2.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/metrics-core-2.2.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/netty-all-4.0.23.Final.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/reporter-config-2.1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/slf4j-api-1.7.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/snakeyaml-1.11.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/snappy-java-1.0.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/stream-2.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/stringtemplate-4.0.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/super-csv-2.1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/thrift-server-0.3.7.jar::/home/naishe/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.2/cassandra-driver-core-2.1.2.jar:/home/naishe/.m2/repository/org/apache/cassandra/cassandra-all/2.1.2/cassandra-all-2.1.2.jar

I have read somewhere that it is due to version conflict with Guava
library. So, I tried using Guava 11.0.2, that did not help. (
http://stackoverflow.com/questions/27089126/nosuchmethoderror-sets-newconcurrenthashset-while-running-jar-using-hadoop#comment42687234_27089126
)

Here is the Pig latin that I was trying to execute.

grunt alice = LOAD 'cql://hadoop_test/lines' USING CqlNativeStorage();
2015-01-22 09:28:54,133 [main] INFO
 org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
deprecated. Instead, use fs.defaultFS
grunt B = foreach alice generate flatten(TOKENIZE((chararray)$0)) as word;
grunt C = group B by word;
grunt D = foreach C generate COUNT(B) as word_count, group as word;
grunt dump D;
2015-01-22 09:29:06,808 [main] INFO
 org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: GROUP_BY
[ -- snip -- ]
2015-01-22 09:29:11,254 [LocalJobRunner Map Task Executor #0] INFO
 org.apache.hadoop.mapred.MapTask - Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-01-22 09:29:11,588 [LocalJobRunner Map Task Executor #0] INFO
 org.apache.hadoop.mapred.MapTask - Starting flush of map output
2015-01-22 09:29:11,600 [Thread-22] INFO
 org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2015-01-22 09:29:11,620 [Thread-22] WARN
 org.apache.hadoop.mapred.LocalJobRunner - job_local1857630817_0001
java.lang.Exception: java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NoSuchMethodError:
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;
at

Cassandra 2.1.2 - How to get node repair progress

2015-01-22 Thread Di, Jieming
Hi,

I am using incremental repair in Cassandra 2.1.2 right now, I am wondering if 
there is any API that I can get the current progress of the current repair job? 
 That would be a great help. Thanks.

Regards,
-Jieming-



Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread SEGALIS Morgan
what do you mean by operating correctly ?
I only use dynamic columns if that helps...

2015-01-22 19:10 GMT+01:00 Robert Coli rc...@eventbrite.com:

 On Thu, Jan 22, 2015 at 9:36 AM, SEGALIS Morgan msega...@gmail.com
 wrote:

 So I wondered, does a nodetool repair make the server stop serving
 requests, or does it just use a lot of ressources but still serves request ?


 In pathological cases, repair can cause a node to seriously degrade. If
 you are operating correctly, it just uses lots of resources but still
 serves requests.

 =Rob
 http://twitter.com/rcolidba




-- 
Morgan SEGALIS


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread SEGALIS Morgan
If I change the network topology, I have to run repair right before adding
a new cluster ?

I know that I should put 2 more nodes, so far, I'm preparing myself to
create a new node on a new DC, but on same network (ping really low), so at
least I would have a backup server if anything happens, just need to wrap
my head around the network topology and understand what I do before
starting a new node.

2015-01-22 19:15 GMT+01:00 Flavien Charlon flavien.char...@gmail.com:

 I don't think you can do nodetool repair on a single node cluster.

 Still, one day or another you'll have to reboot your server, at which
 point your cluster will be down. If you want high availability, you should
 use a 3 nodes cluster with RF = 3.

 On 22 January 2015 at 18:10, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jan 22, 2015 at 9:36 AM, SEGALIS Morgan msega...@gmail.com
 wrote:

 So I wondered, does a nodetool repair make the server stop serving
 requests, or does it just use a lot of ressources but still serves request ?


 In pathological cases, repair can cause a node to seriously degrade. If
 you are operating correctly, it just uses lots of resources but still
 serves requests.

 =Rob
 http://twitter.com/rcolidba





-- 
Morgan SEGALIS


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Tim Heckman
On Thu, Jan 22, 2015 at 10:22 AM, Jan cne...@yahoo.com wrote:
 Running a  'nodetool repair'  will 'not'  bring the node down.

It's not something that happens during normal operation. If something
goes sideways, and the resource usage climbs, a repair can definitely
cripple a node.

 Your question:
 does a nodetool repair make the server stop serving requests, or does it
 just use a lot of ressources but still serves request

 Answer: NO, the server will not stop serving requests.
 It will use some resources but not enough to affect the server serving
 requests.

I don't think this is right. I've personally seen repair operations
cause real bad things to happen to an entire Cassandra cluster. The
only mitigation was to shut that misbehaving node down and then normal
operations continued within the cluster.

 hope this helps
 Jan

Cheers!
-Tim


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread SEGALIS Morgan
Thanks, this is a straight forward answer, exactly what I needed !

2015-01-22 19:22 GMT+01:00 Jan cne...@yahoo.com:

 Running a  'nodetool repair'  will 'not'  bring the node down.

 Your question:
 does a nodetool repair make the server stop serving requests, or does it
 just use a lot of ressources but still serves request

 Answer: NO, the server will not stop serving requests.
 It will use some resources but not enough to affect the server serving
 requests.

 hope this helps
 Jan





-- 
Morgan SEGALIS


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Robert Coli
On Thu, Jan 22, 2015 at 10:53 AM, SEGALIS Morgan msega...@gmail.com wrote:

 what do you mean by operating correctly ?


I mean that if you are operating near failure, repair might trip a node
into failure. But if you are operating correctly, repair should not.

=Rob


UDF and DevCenter

2015-01-22 Thread Andrew Cobley (Staff)

I’m not sure where to send “faults” for the DataStax Devcenter so I’ll send 
them here.  If I define a UDT such as:

CREATE TYPE if not exists sensorsync.SensorReading (

fValue float,
sValue text,
iValue  int
);

and a table

Create table if not exists sensorsync.Sensors(
name uuid,
insertion_time timestamp,
reading map text,frozenSensorReading,
Primary Key (name,insertion_time)
)

If I now want to insert data  but not use all the fields in the UDT DevCenter 
flags it as a fault.  So:

insert into sensorsync.Sensors (name,insertion_time,reading) values 
(7500e917-04b0-4697-ae7e-dbcdbf7415cb,'2015-01-01 
02:10:05',{'sensor':{iValue:101},'sensor1':{fValue:30.5}});

Works ok (rund in devcenter and cqlsh)  but dev centre flags the missing values 
with an error.  Minor, but may throw people a curve.

Andy



The University of Dundee is a registered Scottish Charity, No: SC015096


Re: get partition key from tombstone warnings?

2015-01-22 Thread Philip Thompson
Ian,

Leaving a comment explaining your situation and how, as an operator of a
Cassandra Cluster, this would be valuable, would probably help most.

On Thu, Jan 22, 2015 at 6:06 AM, Paulo Ricardo Motta Gomes 
paulo.mo...@chaordicsystems.com wrote:

 Yep, you may register and log into the Apache JIRA and click Vote for
 this issue, in the upper right-side of the ticket.

 On Wed, Jan 21, 2015 at 11:30 PM, Ian Rose ianr...@fullstory.com wrote:

 Ah, thanks for the pointer Philip.  Is there any kind of formal way to
 vote up issues?  I'm assuming that adding a comment of +1 or the like
 is more likely to be *counter*productive.

 - Ian


 On Wed, Jan 21, 2015 at 5:02 PM, Philip Thompson 
 philip.thomp...@datastax.com wrote:

 There is an open ticket for this improvement at
 https://issues.apache.org/jira/browse/CASSANDRA-8561

 On Wed, Jan 21, 2015 at 4:55 PM, Ian Rose ianr...@fullstory.com wrote:

 When I see a warning like Read 9 live and 5769 tombstoned cells in ...
 etc is there a way for me to see the partition key that this query was
 operating on?

 The description in the original JIRA ticket (
 https://issues.apache.org/jira/browse/CASSANDRA-6042) reads as though
 exposing this information was one of the original goals, but it isn't
 obvious to me in the logs...

 Cheers!
 - Ian






 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200



RE: Retrieving all row keys of a CF

2015-01-22 Thread Ravi Agrawal
Hi,
I increased range timeout, read timeout to first to 50 secs then 500 secs and 
Astyanax client to 60, 550 secs respectively. I still get timeout exception.
I see the logic with .withCheckpointManager() code, is that the only way it 
could work?


From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Saturday, January 17, 2015 9:55 AM
To: user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

If you're getting partial data back, then failing eventually, try setting 
.withCheckpointManager() - this will let you keep track of the token ranges 
you've successfully processed, and not attempt to reprocess them.  This will 
also let you set up tasks on bigger data sets that take hours or days to run, 
and reasonably safely interrupt it at any time without losing progress.

This is some *very* old code, but I dug this out of a git history.  We don't 
use Astyanax any longer, but maybe an example implementation will help you.  
This is Scala instead of Java, but hopefully you can get the gist.

https://gist.github.com/MightyE/83a79b74f3a69cfa3c4e

If you're timing out talking to your cluster, then I don't recommend using the 
cluster to track your checkpoints, but some other data store (maybe just a 
flatfile).  Again, this is just to give you a sense of what's involved.

On Fri, Jan 16, 2015 at 6:31 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Both total system memory and heap size can’t be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal 
[mailto:ragra...@clearpoolgroup.commailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 5:11 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, 50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
   at 
com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:84)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:117)
   at 
com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:338)
   at 
com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$2.execute(ThriftColumnFamilyQueryImpl.java:397)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:447)
   at 
com.netflix.astyanax.recipes.reader.AllRowsReader$1.call(AllRowsReader.java:419)
   at 

Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-22 Thread Ryan Svihla
Usually this is about tuning, and this isn't an uncommon situation for new
users.

Potential steps to take

1) reduce stream throughput to a point that your cluster can handle it.
This is probably your most important tool. The default throughput depending
on version is 200mb or 400mb, go ahead and drop it down further and
further, I've had to use as low as 15 megs on all nodes to get a single
node bootstrapped. Use nodetool for runtime change of this configuration
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsSetStreamThroughput.html

2) Scale up. if you run out of disk space on nodes and can't compact
anymore then add more disk and change where the data is stored ( make sure
your new disk is fast enough to keep up). If it's load add more cpu and ram.
3) Do some root cause analysis. I can't tell you how many of these issues
are bad JVM tuning, or bad cassandra settings.

On Thu, Jan 22, 2015 at 7:50 AM, Kai Wang dep...@gmail.com wrote:

 In last year's summit there was a presentation from Instaclustr -
 https://www.instaclustr.com/meetups/presentation-by-ben-bromhead-at-cassandra-summit-2014-san-francisco/.
 It could be the solution you are looking for. However I don't see the code
 being checked in or JIRA being created. So for now you'd better plan the
 capacity carefully.


 On Wed, Jan 21, 2015 at 11:21 PM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Yes, my cluster is almost full and there are lots of pending tasks. You
 helped me a lot and thank you Eric~

 On Thu, Jan 22, 2015 at 11:59 AM, Eric Stevens migh...@gmail.com wrote:

 Yes, bootstrapping a new node will cause read loads on your existing
 nodes - it is becoming the owner and replica of a whole new set of existing
 data.  To do that it needs to know what data it's now responsible for, and
 that's what bootstrapping is for.

 If you're at the point where bootstrapping a new node is placing a
 too-heavy burden on your existing nodes, you may be dangerously close to or
 even past the tipping point where you ought to have already grown your
 cluster.  You need to grow your cluster as soon as possible, and chances
 are you're close to no longer being able to keep up with compaction (see
 nodetool compactionstats, make sure pending tasks is 5, preferably 0 or
 1).  Once you're falling behind on compaction, it becomes difficult to
 successfully bootstrap new nodes, and you're in a very tough spot.


 On Wed, Jan 21, 2015 at 7:43 PM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Thanks for the reply. The bootstrap of new node put a heavy burden on
 the whole cluster and I don't know why. So that' the issue I want to fix
 actually.

 On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens migh...@gmail.com
 wrote:

 Yes, but it won't do what I suspect you're hoping for.  If you disable
 auto_bootstrap in cassandra.yaml the node will join the cluster and will
 not stream any old data from existing nodes.

 The cluster will now be in an inconsistent state.  If you bring enough
 nodes online this way to violate your read consistency level (eg RF=3,
 CL=Quorum, if you bring on 2 nodes this way), some of your queries will be
 missing data that they ought to have returned.

 There is no way to bring a new node online and have it be responsible
 just for new data, and have no responsibility for old data.  It *will* be
 responsible for old data, it just won't *know* about the old data it
 should be responsible for.  Executing a repair will fix this, but only
 because the existing nodes will stream all the missing data to the new
 node.  This will create more pressure on your cluster than just normal
 bootstrapping would have.

 I can't think of any reason you'd want to do that unless you needed to
 grow your cluster really quickly, and were ok with corrupting your old 
 data.

 On Sat, Jan 10, 2015 at 12:39 AM, Yatong Zhang bluefl...@gmail.com
 wrote:

 Hi there,

 I am using C* 2.0.10 and I was trying to add a new node to a
 cluster(actually replace a dead node). But after added the new node some
 other nodes in the cluster had a very high work-load and affected the 
 whole
 performance of the cluster.
 So I am wondering is there a way to add a new node and this node only
 afford new data?









-- 

Thanks,
Ryan Svihla


Re: Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-22 Thread Neha Trivedi
Hello Everyone,
Thanks very much for the input.

Here is my System info.
1. I have single node cluster. (For testing)
2. I have 4GB Memory on the Server and trying to process 200B. ( 1GB is
allocated to Tomcat7, 1 GB to Cassandra and 1 GB to ActiveMQ. Also nltk
Server is running)
3. We are using 2.03 Driver (This is one I can change and try)
4. 64.4 GB HDD
5. Attached Memory and CPU information.

Regards
Neha

On Fri, Jan 23, 2015 at 6:50 AM, Steve Robenalt sroben...@highwire.org
wrote:

 I agree with Rob. You shouldn't need to change the read timeout.

 We had similar issues with intermittent ReadTimeoutExceptions for a while
 when we ran Cassandra on underpowered nodes on AWS. We've also seen them
 when executing unconstrained queries with very large ResultSets (because it
 takes longer than the timeout to return results). If you can share more
 details about the hardware environment you are running your cluster on,
 there are many on the list who can tell you if they are underpowered or not
 (CPUs, memory, and disk/storage config are all important factors).

 You might also try running a newer version of the Java Driver (the later
 2.0.x drivers should all work with Cassandra 2.0.3), and I would also
 suggest moving to a newer (2.0.x) version of Cassandra if you have the
 option to do so. We had to move to Cassandra 2.0.5 some time ago from 2.0.3
 for an issue unrelated to the read timeouts.

 Steve


 On Thu, Jan 22, 2015 at 4:48 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jan 22, 2015 at 4:19 PM, Asit KAUSHIK asitkaushikno...@gmail.com
  wrote:

 There are some values for read timeout  in Cassandra.yaml file and the
 default value is 3 ms change to a bigger value and that resolved our
 issue.

 Having to increase this value is often a strong signal you are Doing It
 Wrong. FWIW!

 =Rob




cat /proc/meminfo 
MemTotal:3838832 kB
MemFree:  176128 kB
Buffers:  172680 kB
Cached:   829556 kB
SwapCached: 8164 kB
Active:  1327288 kB
Inactive: 806532 kB
Active(anon): 697764 kB
Inactive(anon):   458364 kB
Active(file): 629524 kB
Inactive(file):   348168 kB
Unevictable: 1396340 kB
Mlocked: 1396340 kB
SwapTotal:   4194300 kB
SwapFree:4033800 kB
Dirty:40 kB
Writeback: 0 kB
AnonPages:   2524080 kB
Mapped:38052 kB
Shmem:   296 kB
Slab:  83824 kB
SReclaimable:  70124 kB
SUnreclaim:13700 kB
KernelStack:3008 kB
PageTables:11532 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit: 6113716 kB
Committed_AS:4239512 kB
VmallocTotal:   34359738367 kB
VmallocUsed:8784 kB
VmallocChunk:   34359711939 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
DirectMap4k: 3940352 kB
DirectMap2M:   0 kB

cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping: 4
microcode   : 0x415
cpu MHz : 2500.060
cache size  : 25600 KB
physical id : 1
siblings: 1
core id : 1
cpu cores   : 1
apicid  : 35
initial apicid  : 35
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu de tsc msr pae cx8 apic sep cmov pat clflush mmx fxsr sse 
sse2 ss ht syscall nx lm constant_tsc rep_good nopl pni pclmulqdq ssse3 cx16 
sse4_1 sse4_2 popcnt tsc_deadline_timer aes rdrand hypervisor lahf_lm ida arat 
epb pln pts dtherm fsgsbase erms
bogomips: 5000.12
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:


Tombstone gc after gc grace seconds

2015-01-22 Thread Ravi Agrawal
Hi,
I want to trigger just tombstone compaction after gc grace seconds is completed 
not nodetool compact keyspace column family.
Anyway I can do that?

Thanks




Re: Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-22 Thread Asit KAUSHIK
There are some values for read timeout  in Cassandra.yaml file and the
default value is 3 ms change to a bigger value and that resolved our
issue.
Hope this helps
Regards
Asit
On Jan 22, 2015 8:36 AM, Neha Trivedi nehajtriv...@gmail.com wrote:


 Hello All,
 I am trying to process 200MB file. I am getting following Error. We are
 using (apache-cassandra-2.0.3.jar)
 com.datastax.driver.core.
 exceptions.ReadTimeoutException: Cassandra timeout during read query at
 consistency ONE (1 responses were required but only 0 replica responded)

 1. Is it due to memory?
 2. Is it related to driver?

 Initially when I was trying 15MB and it was throwing the same Exception
 but after that it started working.


 thanks
 regards
 neha






RE: Retrieving all row keys of a CF

2015-01-22 Thread Mohammed Guller
What is the average and max # of CQL rows in each partition? Is 800,000 the 
number of CQL rows or Cassandra partitions (storage engine rows)?

Another option you could try is a CQL statement to fetch all partition keys. 
You could first try this in the cqlsh:

“SELECT DISTINCT pk1, pk2…pkn FROM CF”

You will need to specify all the composite columns if you are using a composite 
partition key.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 22, 2015 1:57 PM
To: user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

Hi,
I increased range timeout, read timeout to first to 50 secs then 500 secs and 
Astyanax client to 60, 550 secs respectively. I still get timeout exception.
I see the logic with .withCheckpointManager() code, is that the only way it 
could work?


From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Saturday, January 17, 2015 9:55 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

If you're getting partial data back, then failing eventually, try setting 
.withCheckpointManager() - this will let you keep track of the token ranges 
you've successfully processed, and not attempt to reprocess them.  This will 
also let you set up tasks on bigger data sets that take hours or days to run, 
and reasonably safely interrupt it at any time without losing progress.

This is some *very* old code, but I dug this out of a git history.  We don't 
use Astyanax any longer, but maybe an example implementation will help you.  
This is Scala instead of Java, but hopefully you can get the gist.

https://gist.github.com/MightyE/83a79b74f3a69cfa3c4e

If you're timing out talking to your cluster, then I don't recommend using the 
cluster to track your checkpoints, but some other data store (maybe just a 
flatfile).  Again, this is just to give you a sense of what's involved.

On Fri, Jan 16, 2015 at 6:31 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Both total system memory and heap size can’t be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal 
[mailto:ragra...@clearpoolgroup.commailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 5:11 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, 50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:231)
   at 
com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:198)
   at 

Re: Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-22 Thread Robert Coli
On Thu, Jan 22, 2015 at 4:19 PM, Asit KAUSHIK asitkaushikno...@gmail.com
wrote:

 There are some values for read timeout  in Cassandra.yaml file and the
 default value is 3 ms change to a bigger value and that resolved our
 issue.

Having to increase this value is often a strong signal you are Doing It
Wrong. FWIW!

=Rob


Re: Fwd: ReadTimeoutException in Cassandra 2.0.11

2015-01-22 Thread Steve Robenalt
I agree with Rob. You shouldn't need to change the read timeout.

We had similar issues with intermittent ReadTimeoutExceptions for a while
when we ran Cassandra on underpowered nodes on AWS. We've also seen them
when executing unconstrained queries with very large ResultSets (because it
takes longer than the timeout to return results). If you can share more
details about the hardware environment you are running your cluster on,
there are many on the list who can tell you if they are underpowered or not
(CPUs, memory, and disk/storage config are all important factors).

You might also try running a newer version of the Java Driver (the later
2.0.x drivers should all work with Cassandra 2.0.3), and I would also
suggest moving to a newer (2.0.x) version of Cassandra if you have the
option to do so. We had to move to Cassandra 2.0.5 some time ago from 2.0.3
for an issue unrelated to the read timeouts.

Steve


On Thu, Jan 22, 2015 at 4:48 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jan 22, 2015 at 4:19 PM, Asit KAUSHIK asitkaushikno...@gmail.com
 wrote:

 There are some values for read timeout  in Cassandra.yaml file and the
 default value is 3 ms change to a bigger value and that resolved our
 issue.

 Having to increase this value is often a strong signal you are Doing It
 Wrong. FWIW!

 =Rob




Re: Cassandra 2.1.2, Pig 0.14, Hadoop 2.6.0 does not work together

2015-01-22 Thread Dave Brosius

The method

com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;

should be available in guava from 15.0 on. So guava-16.0 should be fine.

It's possible guava is being picked up from somewhere else? have a 
global classpath variable?


you might want to do

URL u = YourClass.getResource(/com/google/common/collect/Sets.class);
System.out.println(u);

to see where you are loading guava from.


On 01/22/2015 04:12 AM, Pinak Pani wrote:
I am using Pig with Cassandra (Cassandra 2.1.2, Pig 0.14, Hadoop 2.6.0 
combo).


When I use CqlStorage() I get

org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
org.apache.cassandra.exceptions.ConfigurationException: Unable to find 
inputformat class 'org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat/


When I use CqlNativeStorage() I get

java.lang.NoSuchMethodError: 
com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;


Pig classpath looks like this:

» echo $PIG_CLASSPATH

/home/naishe/apps/apache-cassandra-2.1.2/lib/airline-0.6.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/antlr-runtime-3.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-clientutil-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/apache-cassandra-thrift-2.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-cli-1.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-codec-1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-lang3-3.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/commons-math3-3.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/compress-lzf-0.8.4.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/concurrentlinkedhashmap-lru-1.4.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/disruptor-3.0.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/*guava-16.0.jar*:/home/naishe/apps/apache-cassandra-2.1.2/lib/high-scale-lib-1.0.6.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jackson-core-asl-1.9.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jackson-mapper-asl-1.9.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jamm-0.2.8.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/javax.inject.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jbcrypt-0.3m.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jline-1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/jna-4.0.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/json-simple-1.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/libthrift-0.9.1.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/logback-classic-1.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/logback-core-1.1.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/lz4-1.2.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/metrics-core-2.2.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/netty-all-4.0.23.Final.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/reporter-config-2.1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/slf4j-api-1.7.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/snakeyaml-1.11.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/snappy-java-1.0.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/stream-2.5.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/stringtemplate-4.0.2.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/super-csv-2.1.0.jar:/home/naishe/apps/apache-cassandra-2.1.2/lib/thrift-server-0.3.7.jar::/home/naishe/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.2/cassandra-driver-core-2.1.2.jar:/home/naishe/.m2/repository/org/apache/cassandra/cassandra-all/2.1.2/cassandra-all-2.1.2.jar

I have read somewhere that it is due to version conflict with Guava 
library. So, I tried using Guava 11.0.2, that did not help. 
(http://stackoverflow.com/questions/27089126/nosuchmethoderror-sets-newconcurrenthashset-while-running-jar-using-hadoop#comment42687234_27089126)


Here is the Pig latin that I was trying to execute.

grunt alice = LOAD 'cql://hadoop_test/lines' USING CqlNativeStorage();
2015-01-22 09:28:54,133 [main] INFO 
 org.apache.hadoop.conf.Configuration.deprecation - fs.default.name 
http://fs.default.name is deprecated. Instead, use fs.defaultFS
grunt B = foreach alice generate flatten(TOKENIZE((chararray)$0)) as 
word;

grunt C = group B by word;
grunt D = foreach C generate COUNT(B) as word_count, group as word;
grunt dump D;
2015-01-22 09:29:06,808 [main] INFO 
 org.apache.pig.tools.pigstats.ScriptState - Pig features used in the 
script: GROUP_BY

[ -- snip -- ]
2015-01-22 09:29:11,254 [LocalJobRunner Map Task Executor #0] INFO 
 org.apache.hadoop.mapred.MapTask - Map output collector class = 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-01-22 09:29:11,588 [LocalJobRunner Map Task Executor #0] INFO 
 org.apache.hadoop.mapred.MapTask - Starting flush of map output
2015-01-22 09:29:11,600 [Thread-22] INFO 
 org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2015-01-22 09:29:11,620 [Thread-22] WARN 
 

Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread SEGALIS Morgan
Don't think it is near failure, it uses only 3% of the CPU and 40% of the
RAM if that is what you meant.

2015-01-22 19:58 GMT+01:00 Robert Coli rc...@eventbrite.com:

 On Thu, Jan 22, 2015 at 10:53 AM, SEGALIS Morgan msega...@gmail.com
 wrote:

 what do you mean by operating correctly ?


 I mean that if you are operating near failure, repair might trip a node
 into failure. But if you are operating correctly, repair should not.

 =Rob





-- 
Morgan SEGALIS


Cassandra row ordering best practice Modeling

2015-01-22 Thread SEGALIS Morgan
I have a column family that store articles. I'll need to get those articles
from the most recent to the oldest, getting them from Country, and of
course the ability to limit the number of fetched articles.

I though about another ColumnFamily ArticlesByDateAndCountry with dynamic
columns

The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
articles day's date so something like : US-20150118 or FR-20141230 --
(XX-MMDD)

In those Row, the column name would be the timeuuid of the article, and the
value is the article's ID.

It would probably get a thousand of articles per day for each country.

Let's say I want to show only 100 of the newer articles, I'll get the
today's articles, and if it does not fill the request (too few articles),
I'll check the day before that, etc...

Is that the best practice, or does someone has a better idea for this
purpose ?


Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread DuyHai Doan
Hello Morgan

 The data model looks reasonable. Bucketing by day will help you to scale.
The only thing I can see is how to go back in time to fetch articles from
previous buckets (previous days). It is possible to have 0 article for a
country for a day ?


On Thu, Jan 22, 2015 at 8:23 PM, SEGALIS Morgan msega...@gmail.com wrote:

 Sorry, I copied/pasted the question from another platform where you don't
 generally say hello,

 So : Hello everyone,


 2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
 articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article, and
 the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




 --
 Morgan SEGALIS



Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread DuyHai Doan
well, if the current day bucket does not contain enough article, you may
need to search back in the previous day. If the previous day does not have
any article, you may need to go back time a day before ... and so on ...

 Of course it's a corner case but I've seen some code that misses this
scenario and ends up in an infinite loop back in time ...

On Thu, Jan 22, 2015 at 8:41 PM, SEGALIS Morgan msega...@gmail.com wrote:

 Hi DuyHai,

 if there is 0 article, the row will obviously not exist I guess... (no
 article insertion will create the row)
 What is bugging you exactly ?

 2015-01-22 20:33 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 Hello Morgan

  The data model looks reasonable. Bucketing by day will help you to
 scale. The only thing I can see is how to go back in time to fetch articles
 from previous buckets (previous days). It is possible to have 0 article for
 a country for a day ?


 On Thu, Jan 22, 2015 at 8:23 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Sorry, I copied/pasted the question from another platform where you
 don't generally say hello,

 So : Hello everyone,


 2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
 articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article, and
 the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




 --
 Morgan SEGALIS





 --
 Morgan SEGALIS



Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread SEGALIS Morgan
Sorry, I copied/pasted the question from another platform where you don't
generally say hello,

So : Hello everyone,


2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
 articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article, and
 the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




-- 
Morgan SEGALIS


Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread SEGALIS Morgan
Hi DuyHai,

if there is 0 article, the row will obviously not exist I guess... (no
article insertion will create the row)
What is bugging you exactly ?

2015-01-22 20:33 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 Hello Morgan

  The data model looks reasonable. Bucketing by day will help you to scale.
 The only thing I can see is how to go back in time to fetch articles from
 previous buckets (previous days). It is possible to have 0 article for a
 country for a day ?


 On Thu, Jan 22, 2015 at 8:23 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Sorry, I copied/pasted the question from another platform where you don't
 generally say hello,

 So : Hello everyone,


 2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
 articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article, and
 the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




 --
 Morgan SEGALIS





-- 
Morgan SEGALIS


Re: UDF and DevCenter

2015-01-22 Thread Alex Popescu
Thanks for the feedback Andy. I'll forward this to the DevCenter team.

Currently we have an email for sending feedback our way:
devcenter-feedb...@datastax.com. And the good news is that in the next
release there will be an integrated feedback form directly in DevCenter.

On Thu, Jan 22, 2015 at 8:15 AM, Andrew Cobley (Staff) 
a.e.cob...@dundee.ac.uk wrote:


 I’m not sure where to send “faults” for the DataStax Devcenter so I’ll
 send them here.  If I define a UDT such as:

  CREATE TYPE if not exists sensorsync.SensorReading (

 fValue float,
 sValue text,
 iValue  int
 );

  and a table

  Create table if not exists sensorsync.Sensors(
 name uuid,
 insertion_time timestamp,
 reading map text,frozenSensorReading,
 Primary Key (name,insertion_time)
 )

  If I now want to insert data  but not use all the fields in the UDT
 DevCenter flags it as a fault.  So:

   insert into sensorsync.Sensors (name,insertion_time,reading) values (
 7500e917-04b0-4697-ae7e-dbcdbf7415cb,'2015-01-01 02:10:05',{'sensor':{
 iValue:101},'sensor1':{fValue:30.5}});

  Works ok (rund in devcenter and cqlsh)  but dev centre flags the missing
 values with an error.  Minor, but may throw people a curve.

  Andy



 The University of Dundee is a registered Scottish Charity, No: SC015096




-- 

[:-a)

Alex Popescu
Sen. Product Manager @ DataStax
@al3xandru


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Robert Coli
On Thu, Jan 22, 2015 at 9:36 AM, SEGALIS Morgan msega...@gmail.com wrote:

 So I wondered, does a nodetool repair make the server stop serving
 requests, or does it just use a lot of ressources but still serves request ?


In pathological cases, repair can cause a node to seriously degrade. If you
are operating correctly, it just uses lots of resources but still serves
requests.

=Rob
http://twitter.com/rcolidba


Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Flavien Charlon
I don't think you can do nodetool repair on a single node cluster.

Still, one day or another you'll have to reboot your server, at which point
your cluster will be down. If you want high availability, you should use a
3 nodes cluster with RF = 3.

On 22 January 2015 at 18:10, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Jan 22, 2015 at 9:36 AM, SEGALIS Morgan msega...@gmail.com
 wrote:

 So I wondered, does a nodetool repair make the server stop serving
 requests, or does it just use a lot of ressources but still serves request ?


 In pathological cases, repair can cause a node to seriously degrade. If
 you are operating correctly, it just uses lots of resources but still
 serves requests.

 =Rob
 http://twitter.com/rcolidba



Re: Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread Jan
Running a  'nodetool repair'  will 'not'  bring the node down. 
Your question: does a nodetool repair make the server stop serving requests, or 
does it just use a lot of ressources but still serves request 

Answer:     NO, the server will not stop serving requests.      It will use 
some resources but not enough to affect the server serving requests.   
hope this helpsJan



Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread SEGALIS Morgan
Oh yeah, I though about it, even raised the reflexion on the first mail,

Let's say I want to show only 100 of the newer articles, I'll get the
today's articles, and if it does not fill the request (too few articles),
I'll check the day before that, etc...

but your answer raised another issue I did not though of before :
- going back on previous days, let's say I want 100 newest articles
- If there is at most 1 article per day, and some 0, I will have do more
100+ queries to get all the posts, won't it be a little too much ?

2015-01-22 20:47 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 well, if the current day bucket does not contain enough article, you may
 need to search back in the previous day. If the previous day does not have
 any article, you may need to go back time a day before ... and so on ...

  Of course it's a corner case but I've seen some code that misses this
 scenario and ends up in an infinite loop back in time ...

 On Thu, Jan 22, 2015 at 8:41 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Hi DuyHai,

 if there is 0 article, the row will obviously not exist I guess... (no
 article insertion will create the row)
 What is bugging you exactly ?

 2015-01-22 20:33 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 Hello Morgan

  The data model looks reasonable. Bucketing by day will help you to
 scale. The only thing I can see is how to go back in time to fetch articles
 from previous buckets (previous days). It is possible to have 0 article for
 a country for a day ?


 On Thu, Jan 22, 2015 at 8:23 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Sorry, I copied/pasted the question from another platform where you
 don't generally say hello,

 So : Hello everyone,


 2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, 
 and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and the
 articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article,
 and the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




 --
 Morgan SEGALIS





 --
 Morgan SEGALIS





-- 
Morgan SEGALIS


Re: Cassandra row ordering best practice Modeling

2015-01-22 Thread DuyHai Doan
You get it :D

 This is the real issue. However it's quite an extreme case. If you can
guarantee that there will be a minimum X articles per day and per country,
the maximum number of request to fetch 100 articles will be bounded.

 Furthermore, do not forget that SELECT statement using a partition key
will leverage bloom filters so in case of true negative (no article for a
day) Cassandra will not touch disk

On Thu, Jan 22, 2015 at 9:30 PM, SEGALIS Morgan msega...@gmail.com wrote:

 Oh yeah, I though about it, even raised the reflexion on the first mail,

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 but your answer raised another issue I did not though of before :
 - going back on previous days, let's say I want 100 newest articles
 - If there is at most 1 article per day, and some 0, I will have do more
 100+ queries to get all the posts, won't it be a little too much ?

 2015-01-22 20:47 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 well, if the current day bucket does not contain enough article, you may
 need to search back in the previous day. If the previous day does not have
 any article, you may need to go back time a day before ... and so on ...

  Of course it's a corner case but I've seen some code that misses this
 scenario and ends up in an infinite loop back in time ...

 On Thu, Jan 22, 2015 at 8:41 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Hi DuyHai,

 if there is 0 article, the row will obviously not exist I guess... (no
 article insertion will create the row)
 What is bugging you exactly ?

 2015-01-22 20:33 GMT+01:00 DuyHai Doan doanduy...@gmail.com:

 Hello Morgan

  The data model looks reasonable. Bucketing by day will help you to
 scale. The only thing I can see is how to go back in time to fetch articles
 from previous buckets (previous days). It is possible to have 0 article for
 a country for a day ?


 On Thu, Jan 22, 2015 at 8:23 PM, SEGALIS Morgan msega...@gmail.com
 wrote:

 Sorry, I copied/pasted the question from another platform where you
 don't generally say hello,

 So : Hello everyone,


 2015-01-22 20:19 GMT+01:00 SEGALIS Morgan msega...@gmail.com:

 I have a column family that store articles. I'll need to get those
 articles from the most recent to the oldest, getting them from Country, 
 and
 of course the ability to limit the number of fetched articles.

 I though about another ColumnFamily ArticlesByDateAndCountry with
 dynamic columns

 The Key would a mix from the 2 Char country Code (ISO 3166-1), and
 the articles day's date so something like : US-20150118 or FR-20141230 --
 (XX-MMDD)

 In those Row, the column name would be the timeuuid of the article,
 and the value is the article's ID.

 It would probably get a thousand of articles per day for each country.

 Let's say I want to show only 100 of the newer articles, I'll get the
 today's articles, and if it does not fill the request (too few articles),
 I'll check the day before that, etc...

 Is that the best practice, or does someone has a better idea for this
 purpose ?




 --
 Morgan SEGALIS





 --
 Morgan SEGALIS





 --
 Morgan SEGALIS



Does nodetool repair stop the node to answer requests ?

2015-01-22 Thread SEGALIS Morgan
I have been searching all over documentation but could not find an straight
answer.

For a project I'm using a single node cassandra database (so far)... It has
always worked well, but I'm reading everywhere that I should do a nodetool
repair at least every week, especially if I delete rows, which I do.

The issue is, of course I can bring down the unique node, otherwise the
service will be down, and that's unfortunately not an option.

So I wondered, does a nodetool repair make the server stop serving
requests, or does it just use a lot of ressources but still serves request ?

-- 
Morgan SEGALIS


RE: Retrieving all row keys of a CF

2015-01-22 Thread Ravi Agrawal
In each partition cql rows on average is 200K. Max is 3M.
800K is number of cassandra partitions.


From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Thursday, January 22, 2015 7:43 PM
To: user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

What is the average and max # of CQL rows in each partition? Is 800,000 the 
number of CQL rows or Cassandra partitions (storage engine rows)?

Another option you could try is a CQL statement to fetch all partition keys. 
You could first try this in the cqlsh:

“SELECT DISTINCT pk1, pk2…pkn FROM CF”

You will need to specify all the composite columns if you are using a composite 
partition key.

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Thursday, January 22, 2015 1:57 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

Hi,
I increased range timeout, read timeout to first to 50 secs then 500 secs and 
Astyanax client to 60, 550 secs respectively. I still get timeout exception.
I see the logic with .withCheckpointManager() code, is that the only way it 
could work?


From: Eric Stevens [mailto:migh...@gmail.com]
Sent: Saturday, January 17, 2015 9:55 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

If you're getting partial data back, then failing eventually, try setting 
.withCheckpointManager() - this will let you keep track of the token ranges 
you've successfully processed, and not attempt to reprocess them.  This will 
also let you set up tasks on bigger data sets that take hours or days to run, 
and reasonably safely interrupt it at any time without losing progress.

This is some *very* old code, but I dug this out of a git history.  We don't 
use Astyanax any longer, but maybe an example implementation will help you.  
This is Scala instead of Java, but hopefully you can get the gist.

https://gist.github.com/MightyE/83a79b74f3a69cfa3c4e

If you're timing out talking to your cluster, then I don't recommend using the 
cluster to track your checkpoints, but some other data store (maybe just a 
flatfile).  Again, this is just to give you a sense of what's involved.

On Fri, Jan 16, 2015 at 6:31 PM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Both total system memory and heap size can’t be 8GB?

The timeout on the Astyanax client should be greater than the timeouts on the 
C* nodes, otherwise your client will timeout prematurely.

Also, have you tried increasing the timeout for the range queries to a higher 
number? It is not recommended to set them very high, because a lot of other 
problems may start happening, but then reading 800,000 partitions is not a 
normal operation.

Just as an experimentation, can you set the range timeout to 45 seconds on each 
node and the timeout on the Astyanax client to 50 seconds? Restart the nodes 
after increasing the timeout and try again.

Mohammed

From: Ravi Agrawal 
[mailto:ragra...@clearpoolgroup.commailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 5:11 PM

To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF


1)What is the heap size and total memory on each node? 8GB, 8GB
2)How big is the cluster? 4
3)What are the read and range timeouts (in cassandra.yaml) on the 
C* nodes? 10 secs, 10 secs
4)What are the timeouts for the Astyanax client? 2 secs
5)Do you see GC pressure on the C* nodes? How long does GC for new 
gen and old gen take? occurs every 5 secs dont see huge gc pressure, 50ms
6)Does any node crash with OOM error when you try AllRowsReader? No

From: Mohammed Guller [mailto:moham...@glassbeam.com]
Sent: Friday, January 16, 2015 7:30 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Retrieving all row keys of a CF

A few questions:


1)  What is the heap size and total memory on each node?

2)  How big is the cluster?

3)  What are the read and range timeouts (in cassandra.yaml) on the C* 
nodes?

4)  What are the timeouts for the Astyanax client?

5)  Do you see GC pressure on the C* nodes? How long does GC for new gen 
and old gen take?

6)  Does any node crash with OOM error when you try AllRowsReader?

Mohammed

From: Ravi Agrawal [mailto:ragra...@clearpoolgroup.com]
Sent: Friday, January 16, 2015 4:14 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Retrieving all row keys of a CF

Hi,
I and Ruchir tried query using AllRowsReader recipe but had no luck. We are 
seeing PoolTimeoutException.
SEVERE: [Thread_1] Error reading RowKeys
com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: 
PoolTimeoutException: [host=servername, latency=2003(2003), attempts=4]Timed 
out waiting for connection
   at