Re: How to speed up SELECT * query in Cassandra

2015-02-16 Thread mck

> Could you please share how much data you store on the cluster and what
> is HW configuration of the nodes? 


These nodes are dedicated HW, 24 cpu and 50Gb ram.
Each node has a few TBs of data (you don't want to go over this) in
raid50 (we're migrating over to JBOD).
Each c* node is running 2.0.11 and configured to use 8gm heap, 2g new,
and jdk1.7.0_55.

Hadoop (2.2.0) tasktrackers and dfs run on these nodes as well, all up
they use up to 12Gb ram, leaving ~30Gb ram for kernel and page cache.
Data-locality is an important goal, in the worse case scenarios we've
seen it mean a four times throughput benefit.

Hdfs being a volatile hadoop-internals space for us is on SSDs,
providing strong m/r performance.
 (commitlog of course is also on SSD – we made the mistake of putting it
 on the same SSD to begin with. don't do that, commitlog gets its own
 SSD)


> I am really impressed that you are
> able to read 100M records in ~4minutes on 4 nodes. It makes something
> like 100k reads per node, which is something we are quite far away from.


These are not individual reads and not the number of partition keys, but
m/r records (or cql rows).
But yes, the performance of spark against cassandra is impressive.


> It leads me to question, whether reading from Spark goes through
> Cassandra's JVM and thus go through normal read path, or if it reads the
> sstables directly from disks sequentially and possibly filters out
> old/tombstone values by itself?


Both Hadoop-Cassandra integration and the Spark-Cassandra connector goes
through the normal read path like all cql read queries.

With our m/r jobs each task works with just one partition key, doing
repeated column slice reads through that partition key according to the
ConfigHelper.rangeBatchSize setting, which we have set to 100. These
hadoop jobs use a custom written CqlInputFormat due to the poor
performance CqlInputFormat has today against a vnodes setup, the
customisation we have is pretty much the same as the patch on offer in
CASSANDRA-6091.

This problem with vnodes we haven't experienced with the spark
connector.
I presume that, like the hadoop integration, spark also bulk reads
(column slices) from each partition key.

Otherwise this is useful reading
http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting


> This is also a cluster that serves requests to web applications that
> need low latency.

Let it be said this isn't something i'd recommend, just the path we had
to take because of our small initial dedicated-HW cluster.
(You really want to separate online and offline datacenters, so that you
can maximise the offline clusters for the heavy batch reads).

~mck


Re: How to speed up SELECT * query in Cassandra

2015-02-16 Thread Jiri Horky
Hi,

thanks for the reference, I really appreciate that you shared your
experience.

Could you please share how much data you store on the cluster and what
is HW configuration of the nodes? I am really impressed that you are
able to read 100M records in ~4minutes on 4 nodes. It makes something
like 100k reads per node, which is something we are quite far away from.

It leads me to question, whether reading from Spark goes through
Cassandra's JVM and thus go through normal read path, or if it reads the
sstables directly from disks sequentially and possibly filters out
old/tombstone values by itself? If it is the latter, then I understand
why it can perform that well.

Thank you.

Regards
Jirka H.

On 02/14/2015 09:17 PM, mck wrote:
> Jirka,
>
>> But I am really interested how it can work well with Spark/Hadoop where
>> you basically needs to read all the data as well (as far as I understand
>> that).
>
> I can't give you any benchmarking between technologies (nor am i
> particularly interested in getting involved in such a discussion) but i
> can share our experiences with Cassandra, Hadoop, and Spark, over the
> past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
> choice.
>
> On a four node cluster we were running 5000+ small hadoop jobs each day
> each finishing within two minutes, often within one minute, resulting in
> (give or take) a billion records read and 150 millions records written
> from and to c*.
> These small jobs are incrementally processing on limited partition key
> sets each time. These jobs are primarily reading data from a "raw events
> store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads
> over old partition keys are rare).
>
> We also run full-table-scan jobs and have never come across any issues
> particular to that. There are hadoop map/reduce settings to increase
> durability if you have tables with troublesome partition keys.
>
> This is also a cluster that serves requests to web applications that
> need low latency.
>
> We recently wrote a spark job that does full table scans over 100
> million+ rows, involves a handful of stages (two tables, 9 maps, 4
> reduce, and 2 joins), and writes back to a new table 5 millions rows.
> This job runs in ~260 seconds.
>
> Spark is becoming a natural complement to schema evolution for
> cassandra, something you'll want to do to keep your schema optimised
> against your read request patterns, even little things like switching
> cluster keys around. 
>
> With any new technology hitting some hurdles (especially if you go
> wondering outside recommended practices) will of course be part of the
> game, but that said I've only had positive experiences with this
> community's ability to help out (and do so quickly).
>
> Starting from scratch i'd use Spark (on scala) over Hadoop no questions
> asked. 
> Otherwise Cassandra has always been our 'big data' platform,
> hadoop/spark is just an extra tool on top.
> We've never kept data in hdfs and are very grateful for having made that
> choice.
>
> ~mck
>
> ref
> https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/



Re: How to speed up SELECT * query in Cassandra

2015-02-14 Thread mck
Jirka,

> But I am really interested how it can work well with Spark/Hadoop where
> you basically needs to read all the data as well (as far as I understand
> that).


I can't give you any benchmarking between technologies (nor am i
particularly interested in getting involved in such a discussion) but i
can share our experiences with Cassandra, Hadoop, and Spark, over the
past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
choice.

On a four node cluster we were running 5000+ small hadoop jobs each day
each finishing within two minutes, often within one minute, resulting in
(give or take) a billion records read and 150 millions records written
from and to c*.
These small jobs are incrementally processing on limited partition key
sets each time. These jobs are primarily reading data from a "raw events
store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads
over old partition keys are rare).

We also run full-table-scan jobs and have never come across any issues
particular to that. There are hadoop map/reduce settings to increase
durability if you have tables with troublesome partition keys.

This is also a cluster that serves requests to web applications that
need low latency.

We recently wrote a spark job that does full table scans over 100
million+ rows, involves a handful of stages (two tables, 9 maps, 4
reduce, and 2 joins), and writes back to a new table 5 millions rows.
This job runs in ~260 seconds.

Spark is becoming a natural complement to schema evolution for
cassandra, something you'll want to do to keep your schema optimised
against your read request patterns, even little things like switching
cluster keys around. 

With any new technology hitting some hurdles (especially if you go
wondering outside recommended practices) will of course be part of the
game, but that said I've only had positive experiences with this
community's ability to help out (and do so quickly).

Starting from scratch i'd use Spark (on scala) over Hadoop no questions
asked. 
Otherwise Cassandra has always been our 'big data' platform,
hadoop/spark is just an extra tool on top.
We've never kept data in hdfs and are very grateful for having made that
choice.

~mck

ref
https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/


Re: How to speed up SELECT * query in Cassandra

2015-02-13 Thread Jens Rantil
If you are using Spark you need to be _really_ careful about your
tombstones. In our experience a single partition with too many tombstones
can take down the whole batch job (until something like
https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a
major obstacle for us to overcome when using Spark.

Cheers,
Jens

On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky  wrote:

>  Well, I always wondered how Cassandra can by used in Hadoop-like
> environment where you basically need to do full table scan.
>
> I need to say that our experience is that cassandra is perfect for
> writing, reading specific values by key, but definitely not for reading all
> of the data out of it. Some of our projects found out that doing that with
> a not trivial in a timely manner is close to impossible in many situations.
> We are slowly moving to storing the data in HDFS and possibly reprocess
> them on a daily bases for such usecases (statistics).
>
> This is nothing against Cassandra, it can not be perfect for everything.
> But I am really interested how it can work well with Spark/Hadoop where you
> basically needs to read all the data as well (as far as I understand that).
>
> Jirka H.
>
>
> On 02/11/2015 01:51 PM, DuyHai Doan wrote:
>
> "The very nature of cassandra's distributed nature vs partitioning data
> on hadoop makes spark on hdfs actually fasted than on cassandra"
>
>  Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>>  > cassandra makes a very poor datawarehouse ot long term time series
>> store
>>
>>  Really? This is not the impression I have... I think Cassandra is good
>> to store larges amounts of data and historical information, it's only not
>> good to store temporary data.
>> Netflix has a large amount of data and it's all stored in Cassandra,
>> AFAIK.
>>
>>  > The very nature of cassandra's distributed nature vs partitioning
>> data on hadoop makes spark on hdfs actually fasted than on cassandra.
>>
>>  I am not sure about the current state of Spark support for Cassandra,
>> but I guess if you create a map reduce job, the intermediate map results
>> will be still stored in HDFS, as it happens to hadoop, is this right? I
>> think the problem with Spark + Cassandra or with Hadoop + Cassandra is that
>> the hard part spark or hadoop does, the shuffling, could be done out of the
>> box with Cassandra, but no one takes advantage on that. What if a map /
>> reduce job used a temporary CF in Cassandra to store intermediate results?
>>
>>   From: user@cassandra.apache.org
>> Subject: Re: How to speed up SELECT * query in Cassandra
>>
>> I use spark with cassandra, and you dont need DSE.
>>
>>  I see a lot of people ask this same question below (how do I get a lot
>> of data out of cassandra?), and my question is always, why arent you
>> updating both places at once?
>>
>>  For example, we use hadoop and cassandra in conjunction with each
>> other, we use a message bus to store every event in both, aggregrate in
>> both, but only keep current data in cassandra (cassandra makes a very poor
>> datawarehouse ot long term time series store) and then use services to
>> process queries that merge data from hadoop and cassandra.
>>
>>  Also, spark on hdfs gives more flexibility in terms of large datasets
>> and performance.  The very nature of cassandra's distributed nature vs
>> partitioning data on hadoop makes spark on hdfs actually fasted than on
>> cassandra
>>
>>
>>
>> --
>> *Colin Clark*
>>  +1 612 859 6129
>> Skype colin.p.clark
>>
>> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
>>
>>
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
>> mvallemil...@bloomberg.net> wrote:
>>
>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>> Even better, you can use Spark/Shark with DSE.
>>
>>  Cheers,
>> Jens
>>
>>
>>  --
>>  Jens Rantil
>> Backend engineer
>> Tink AB
>>
>>  Email: jens.ran...@tink.se
>> Phone: +46 708 84 18 32
>> Web: www.tink.se
>>
>>  Facebook <https://www.facebook.com/#%21/tink.se> Linkedin
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter <https://twitter.com/tink>
>>
>>
>>
>
>


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook <https://www.facebook.com/#!/tink.se> Linkedin
<http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
 Twitter <https://twitter.com/tink>


Re: How to speed up SELECT * query in Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks Jirka!

From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra

  Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

 loop { lastRow => val query =   lastRow match { case 
Some(row)   => nextPageQuery(row, upperLimit) case None   => 
initialQuery(lowerLimit) }  session.execute(query).all }
  

 private def nextPageQuery(row: Row, 
upperLimit: String): String = {   val   tokenPart = 
"token(%s) > token(0x%s) and token(%s)   < %s".format(rowKeyName,   
  hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)   
basicQuery.format(tokenPart)   }

  
  private def initialQuery(lowerLimit:   String):   String  
 = { val tokenPart =   "token(%s) >= %s".format(rowKeyName,   
lowerLimit)  basicQuery.format(tokenPart) } private def 
calculateRanges:   (BigDecimal,   BigDecimal,   
IndexedSeq[(BigDecimal,   BigDecimal)])   = { tokenRange match {
 case Some((start,   end)) => Logger.info("Token range given: {}", "<" 
+   start.underlying.toPlainString + ", "   + 
end.underlying.toPlainString + ">") val tokenSpaceSize =   end - start  
   val rangeSize =   tokenSpaceSize / concurrency val ranges = for (i 
<- 0 until concurrency) yield (start + (i *   rangeSize), start + ((i + 1)  
 * rangeSize))  (tokenSpaceSize, rangeSize, ranges) 
case None   => val tokenSpaceSize =   partitioner.max - 
partitioner.min val rangeSize =   tokenSpaceSize / concurrency val 
ranges = for (i <- 0 until concurrency) yield (partitioner.min +   (i * 
rangeSize), partitioner.min + ((i + 1) * rangeSize))  
(tokenSpaceSize, rangeSize, ranges) } }

  private   val basicQuery =   { 
"select %s, %s, %s, writetime(%s) from %s where %s%s limit 
%d%s".format( rowKeyName, columnKeyName, 
columnValueName, columnValueName, columnFamily, "%s", 
// template whereCondition, pageSize, if   
(cqlAllowFiltering) " allow filtering" else "" ) }
  

 case object   Murmur3  
 extends   Partitioner   {   override   
  val   min = BigDecimal(-2).pow(63)   
override val   max = BigDecimal(2).pow(63)  
   - 1   } case object  
 Random   extends   Partitioner   { 
  override val   min = BigDecimal(0)
   override val   max = 
BigDecimal(2).pow(127) - 1   }  
  
  
 
On 02/11/2015 02:21 PM, Ja Sam wrote:

Your answer looks very promising 


 How do you calculate start and stop?


On Wed, Feb 11, 2015 at 12:09 PM, Jiri   Horky 
   wrote:
  
The fastest way I am aware of is to do the queries in parallel  
   to
multiple cassandra nodes and make sure that you only ask
 them for keys
they are responsible for. Otherwise, the node needs to 
resend your query
which is much slower and creates unnecessary objects (and   
  thus GC pressure).

You can manually take advantage of the token range 
information, if the
driver does not get this into account for you. Then, you can
 play with
concurrency and batch size of a single query against one
 node.
Basically, what you/driver should do is to transform the
 query to series
of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".

I will need to look up the actual code, but the idea should 
be clear :)

Jirka H.



On 02/11/2015 11:26 AM, Ja Sam wrote:
> Is there a simple way (or even a complicated one) 
how can I speed up
> SELECT * FROM [table] query?
> I need to get all rows form one table every day. I
 split tables, and
> create one for each day, but still query is quite 
slow (200

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

loop {lastRow =>val query = lastRow match {case Some(row) =>
nextPageQuery(row, upperLimit)case None =>
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = "token(%s) > token(0x%s) and token(%s) <
%s".format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
"token(%s) >= %s".format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =>Logger.info("Token range
given: {}", "<" + start.underlying.toPlainString + ", " +
end.underlying.toPlainString + ">")val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i <- 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =>val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i <- 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {"select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s".format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,"%s",
// templatewhereCondition,pageSize,if (cqlAllowFiltering) " allow
filtering" else "")}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
> Your answer looks very promising
>
>  How do you calculate start and stop?
>
> On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky  > wrote:
>
> The fastest way I am aware of is to do the queries in parallel to
> multiple cassandra nodes and make sure that you only ask them for keys
> they are responsible for. Otherwise, the node needs to resend your
> query
> which is much slower and creates unnecessary objects (and thus GC
> pressure).
>
> You can manually take advantage of the token range information, if the
> driver does not get this into account for you. Then, you can play with
> concurrency and batch size of a single query against one node.
> Basically, what you/driver should do is to transform the query to
> series
> of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
>
> I will need to look up the actual code, but the idea should be
> clear :)
>
> Jirka H.
>
>
> On 02/11/2015 11:26 AM, Ja Sam wrote:
> > Is there a simple way (or even a complicated one) how can I speed up
> > SELECT * FROM [table] query?
> > I need to get all rows form one table every day. I split tables, and
> > create one for each day, but still query is quite slow (200 millions
> > of records)
> >
> > I was thinking about run this query in parallel, but I don't know if
> > it is possible
>
>



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.

I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
> "The very nature of cassandra's distributed nature vs partitioning
> data on hadoop makes spark on hdfs actually fasted than on cassandra"
>
> Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
> mailto:mvallemil...@bloomberg.net>> wrote:
>
> > cassandra makes a very poor datawarehouse ot long term time series store
>
> Really? This is not the impression I have... I think Cassandra is
> good to store larges amounts of data and historical information,
> it's only not good to store temporary data.
> Netflix has a large amount of data and it's all stored in
> Cassandra, AFAIK.
>
> > The very nature of cassandra's distributed nature vs partitioning data 
> on hadoop makes spark on hdfs
> actually fasted than on cassandra.
>
> I am not sure about the current state of Spark support for
> Cassandra, but I guess if you create a map reduce job, the
> intermediate map results will be still stored in HDFS, as it
> happens to hadoop, is this right? I think the problem with Spark +
> Cassandra or with Hadoop + Cassandra is that the hard part spark
> or hadoop does, the shuffling, could be done out of the box with
> Cassandra, but no one takes advantage on that. What if a map /
> reduce job used a temporary CF in Cassandra to store intermediate
>     results?
>
> From: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
> Subject: Re: How to speed up SELECT * query in Cassandra
>
> I use spark with cassandra, and you dont need DSE.
>
> I see a lot of people ask this same question below (how do I
> get a lot of data out of cassandra?), and my question is
> always, why arent you updating both places at once?
>
> For example, we use hadoop and cassandra in conjunction with
> each other, we use a message bus to store every event in both,
> aggregrate in both, but only keep current data in cassandra
> (cassandra makes a very poor datawarehouse ot long term time
> series store) and then use services to process queries that
> merge data from hadoop and cassandra.  
>
> Also, spark on hdfs gives more flexibility in terms of large
> datasets and performance.  The very nature of cassandra's
> distributed nature vs partitioning data on hadoop makes spark
> on hdfs actually fasted than on cassandra
>
>
>
> -- 
> *Colin Clark* 
> +1 612 859 6129 
> Skype colin.p.clark
>
> On Feb 11, 2015, at 4:49 AM, Jens Rantil  <mailto:jens.ran...@tink.se>> wrote:
>
>>
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
>> LONDON) > <mailto:mvallemil...@bloomberg.net>> wrote:
>>
>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>> Even better, you can use Spark/Shark with DSE.
>>
>> Cheers,
>> Jens
>>
>>
>> -- 
>> Jens Rantil
>> Backend engineer
>> Tink AB
>>
>> Email: jens.ran...@tink.se <mailto:jens.ran...@tink.se>
>> Phone: +46 708 84 18 32
>> Web: www.tink.se <http://www.tink.se/>
>>
>> Facebook <https://www.facebook.com/#%21/tink.se> Linkedin
>> 
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter
>> <https://twitter.com/tink>
>
>
>



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
No, the question isnt closed.  You dont get to decide that.

I dont run a website making claims regarding cassandra and spark - your 
employer does.   

Again, where are your benchmarks?

I will publish mine, then we'll see what you've got.

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 8:39 AM, DuyHai Doan  wrote:
> 
> For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. 
> Look at "Burden of proof"
> 
> You stated "The very nature of cassandra's distributed nature vs partitioning 
> data on hadoop makes spark on hdfs actually fasted than on cassandra"
> 
> It's up to YOU to prove it right, not up to me to prove it wrong.
> 
> All other bla bla is troll.
> 
> Come back to me once you get some decent benchmarks supporting your 
> statement, until then, the question is closed.
> 
> 
> 
>> On Wed, Feb 11, 2015 at 3:17 PM, Colin  wrote:
>> Did you want me to included specific examples from my employment at datastax 
>> or start from the ground up? 
>> 
>> All spark is on cassandra is a better than the previous use of hive. 
>> 
>> The fact that datastax hasnt provided any benchmarks themselves other than 
>> glossy marketing statements pretty much says it all-where are your 
>> benchmarks?  Maybe you could combine it with the in memory option to really 
>> boogie...
>> 
>> :)
>> 
>> (If I find time, I might just write a blog post about exactly how to do 
>> this-it involves the use of parquet and partitioning with clustering-and it 
>> doesnt cost anything to do it-and it's in production at my company)
>> --
>> Colin Clark 
>> +1 612 859 6129
>> Skype colin.p.clark
>> 
>>> On Feb 11, 2015, at 6:51 AM, DuyHai Doan  wrote:
>>> 
>>> "The very nature of cassandra's distributed nature vs partitioning data on 
>>> hadoop makes spark on hdfs actually fasted than on cassandra"
>>> 
>>> Prove it. Did you ever have a look into the source code of the 
>>> Spark/Cassandra connector to see how data locality is achieved before 
>>> throwing out such statement ?
>>> 
>>>> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>>  wrote:
>>>> > cassandra makes a very poor datawarehouse ot long term time series store
>>>> 
>>>> Really? This is not the impression I have... I think Cassandra is good to 
>>>> store larges amounts of data and historical information, it's only not 
>>>> good to store temporary data.
>>>> Netflix has a large amount of data and it's all stored in Cassandra, 
>>>> AFAIK. 
>>>> 
>>>> > The very nature of cassandra's distributed nature vs partitioning data 
>>>> > on hadoop makes spark on hdfs actually fasted than on cassandra.
>>>> 
>>>> I am not sure about the current state of Spark support for Cassandra, but 
>>>> I guess if you create a map reduce job, the intermediate map results will 
>>>> be still stored in HDFS, as it happens to hadoop, is this right? I think 
>>>> the problem with Spark + Cassandra or with Hadoop + Cassandra is that the 
>>>> hard part spark or hadoop does, the shuffling, could be done out of the 
>>>> box with Cassandra, but no one takes advantage on that. What if a map / 
>>>> reduce job used a temporary CF in Cassandra to store intermediate results?
>>>> 
>>>> From: user@cassandra.apache.org 
>>>> Subject: Re: How to speed up SELECT * query in Cassandra
>>>> I use spark with cassandra, and you dont need DSE.
>>>> 
>>>> I see a lot of people ask this same question below (how do I get a lot of 
>>>> data out of cassandra?), and my question is always, why arent you updating 
>>>> both places at once?
>>>> 
>>>> For example, we use hadoop and cassandra in conjunction with each other, 
>>>> we use a message bus to store every event in both, aggregrate in both, but 
>>>> only keep current data in cassandra (cassandra makes a very poor 
>>>> datawarehouse ot long term time series store) and then use services to 
>>>> process queries that merge data from hadoop and cassandra.  
>>>> 
>>>> Also, spark on hdfs gives more flexibility in terms of large datasets and 
>>>> performance.  The very nature of cassandra's distributed nature vs 
>>>> partitioning data on hadoop makes spark on hdfs actually fasted than on 
>>>> cassandra
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Colin Clark 
>>>> +1 612 859 6129
>>>> Skype colin.p.clark
>>>> 
>>>>> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
>>>>> 
>>>>> 
>>>>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>>>>  wrote:
>>>>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>>>> 
>>>>> Even better, you can use Spark/Shark with DSE.
>>>>> 
>>>>> Cheers,
>>>>> Jens
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jens Rantil
>>>>> Backend engineer
>>>>> Tink AB
>>>>> 
>>>>> Email: jens.ran...@tink.se
>>>>> Phone: +46 708 84 18 32
>>>>> Web: www.tink.se
>>>>> 
>>>>> Facebook Linkedin Twitter
> 


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies.
Look at "Burden of proof
<http://en.wikipedia.org/wiki/Philosophic_burden_of_proof>"

You stated "The very nature of cassandra's distributed nature vs
partitioning data on hadoop makes spark on hdfs actually fasted than on
cassandra"

It's up to YOU to prove it right, not up to me to prove it wrong.

All other bla bla is troll.

Come back to me once you get some decent benchmarks supporting your
statement, until then, the question is closed.



On Wed, Feb 11, 2015 at 3:17 PM, Colin  wrote:

> Did you want me to included specific examples from my employment at
> datastax or start from the ground up?
>
> All spark is on cassandra is a better than the previous use of hive.
>
> The fact that datastax hasnt provided any benchmarks themselves other than
> glossy marketing statements pretty much says it all-where are your
> benchmarks?  Maybe you could combine it with the in memory option to really
> boogie...
>
> :)
>
> (If I find time, I might just write a blog post about exactly how to do
> this-it involves the use of parquet and partitioning with clustering-and it
> doesnt cost anything to do it-and it's in production at my company)
> --
> *Colin Clark*
> +1 612 859 6129
> Skype colin.p.clark
>
> On Feb 11, 2015, at 6:51 AM, DuyHai Doan  wrote:
>
> "The very nature of cassandra's distributed nature vs partitioning data
> on hadoop makes spark on hdfs actually fasted than on cassandra"
>
> Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>> > cassandra makes a very poor datawarehouse ot long term time series store
>>
>> Really? This is not the impression I have... I think Cassandra is good to
>> store larges amounts of data and historical information, it's only not good
>> to store temporary data.
>> Netflix has a large amount of data and it's all stored in Cassandra,
>> AFAIK.
>>
>> > The very nature of cassandra's distributed nature vs partitioning data
>> on hadoop makes spark on hdfs actually fasted than on cassandra.
>>
>> I am not sure about the current state of Spark support for Cassandra, but
>> I guess if you create a map reduce job, the intermediate map results will
>> be still stored in HDFS, as it happens to hadoop, is this right? I think
>> the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
>> hard part spark or hadoop does, the shuffling, could be done out of the box
>> with Cassandra, but no one takes advantage on that. What if a map / reduce
>> job used a temporary CF in Cassandra to store intermediate results?
>>
>> From: user@cassandra.apache.org
>> Subject: Re: How to speed up SELECT * query in Cassandra
>>
>> I use spark with cassandra, and you dont need DSE.
>>
>> I see a lot of people ask this same question below (how do I get a lot of
>> data out of cassandra?), and my question is always, why arent you updating
>> both places at once?
>>
>> For example, we use hadoop and cassandra in conjunction with each other,
>> we use a message bus to store every event in both, aggregrate in both, but
>> only keep current data in cassandra (cassandra makes a very poor
>> datawarehouse ot long term time series store) and then use services to
>> process queries that merge data from hadoop and cassandra.
>>
>> Also, spark on hdfs gives more flexibility in terms of large datasets and
>> performance.  The very nature of cassandra's distributed nature vs
>> partitioning data on hadoop makes spark on hdfs actually fasted than on
>> cassandra
>>
>>
>>
>> --
>> *Colin Clark*
>> +1 612 859 6129
>> Skype colin.p.clark
>>
>> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
>>
>>
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
>> mvallemil...@bloomberg.net> wrote:
>>
>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>> Even better, you can use Spark/Shark with DSE.
>>
>> Cheers,
>> Jens
>>
>>
>> --
>> Jens Rantil
>> Backend engineer
>> Tink AB
>>
>> Email: jens.ran...@tink.se
>> Phone: +46 708 84 18 32
>> Web: www.tink.se
>>
>> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
>> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>>  Twitter <https://twitter.com/tink>
>>
>>
>>
>


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
Did you want me to included specific examples from my employment at datastax or 
start from the ground up? 

All spark is on cassandra is a better than the previous use of hive. 

The fact that datastax hasnt provided any benchmarks themselves other than 
glossy marketing statements pretty much says it all-where are your benchmarks?  
Maybe you could combine it with the in memory option to really boogie...

:)

(If I find time, I might just write a blog post about exactly how to do this-it 
involves the use of parquet and partitioning with clustering-and it doesnt cost 
anything to do it-and it's in production at my company)
--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 6:51 AM, DuyHai Doan  wrote:
> 
> "The very nature of cassandra's distributed nature vs partitioning data on 
> hadoop makes spark on hdfs actually fasted than on cassandra"
> 
> Prove it. Did you ever have a look into the source code of the 
> Spark/Cassandra connector to see how data locality is achieved before 
> throwing out such statement ?
> 
>> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>  wrote:
>> > cassandra makes a very poor datawarehouse ot long term time series store
>> 
>> Really? This is not the impression I have... I think Cassandra is good to 
>> store larges amounts of data and historical information, it's only not good 
>> to store temporary data.
>> Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 
>> 
>> > The very nature of cassandra's distributed nature vs partitioning data on 
>> > hadoop makes spark on hdfs actually fasted than on cassandra.
>> 
>> I am not sure about the current state of Spark support for Cassandra, but I 
>> guess if you create a map reduce job, the intermediate map results will be 
>> still stored in HDFS, as it happens to hadoop, is this right? I think the 
>> problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard 
>> part spark or hadoop does, the shuffling, could be done out of the box with 
>> Cassandra, but no one takes advantage on that. What if a map / reduce job 
>> used a temporary CF in Cassandra to store intermediate results?
>> 
>> From: user@cassandra.apache.org 
>> Subject: Re: How to speed up SELECT * query in Cassandra
>> I use spark with cassandra, and you dont need DSE.
>> 
>> I see a lot of people ask this same question below (how do I get a lot of 
>> data out of cassandra?), and my question is always, why arent you updating 
>> both places at once?
>> 
>> For example, we use hadoop and cassandra in conjunction with each other, we 
>> use a message bus to store every event in both, aggregrate in both, but only 
>> keep current data in cassandra (cassandra makes a very poor datawarehouse ot 
>> long term time series store) and then use services to process queries that 
>> merge data from hadoop and cassandra.  
>> 
>> Also, spark on hdfs gives more flexibility in terms of large datasets and 
>> performance.  The very nature of cassandra's distributed nature vs 
>> partitioning data on hadoop makes spark on hdfs actually fasted than on 
>> cassandra
>> 
>> 
>> 
>> --
>> Colin Clark 
>> +1 612 859 6129
>> Skype colin.p.clark
>> 
>>> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
>>> 
>>> 
>>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>>  wrote:
>>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>> 
>>> Even better, you can use Spark/Shark with DSE.
>>> 
>>> Cheers,
>>> Jens
>>> 
>>> 
>>> -- 
>>> Jens Rantil
>>> Backend engineer
>>> Tink AB
>>> 
>>> Email: jens.ran...@tink.se
>>> Phone: +46 708 84 18 32
>>> Web: www.tink.se
>>> 
>>> Facebook Linkedin Twitter
>> 
> 


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Your answer looks very promising

 How do you calculate start and stop?

On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky  wrote:

> The fastest way I am aware of is to do the queries in parallel to
> multiple cassandra nodes and make sure that you only ask them for keys
> they are responsible for. Otherwise, the node needs to resend your query
> which is much slower and creates unnecessary objects (and thus GC
> pressure).
>
> You can manually take advantage of the token range information, if the
> driver does not get this into account for you. Then, you can play with
> concurrency and batch size of a single query against one node.
> Basically, what you/driver should do is to transform the query to series
> of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
>
> I will need to look up the actual code, but the idea should be clear :)
>
> Jirka H.
>
>
> On 02/11/2015 11:26 AM, Ja Sam wrote:
> > Is there a simple way (or even a complicated one) how can I speed up
> > SELECT * FROM [table] query?
> > I need to get all rows form one table every day. I split tables, and
> > create one for each day, but still query is quite slow (200 millions
> > of records)
> >
> > I was thinking about run this query in parallel, but I don't know if
> > it is possible
>
>


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
"The very nature of cassandra's distributed nature vs partitioning data on
hadoop makes spark on hdfs actually fasted than on cassandra"

Prove it. Did you ever have a look into the source code of the
Spark/Cassandra connector to see how data locality is achieved before
throwing out such statement ?

On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemil...@bloomberg.net> wrote:

> > cassandra makes a very poor datawarehouse ot long term time series store
>
> Really? This is not the impression I have... I think Cassandra is good to
> store larges amounts of data and historical information, it's only not good
> to store temporary data.
> Netflix has a large amount of data and it's all stored in Cassandra,
> AFAIK.
>
> > The very nature of cassandra's distributed nature vs partitioning data
> on hadoop makes spark on hdfs actually fasted than on cassandra.
>
> I am not sure about the current state of Spark support for Cassandra, but
> I guess if you create a map reduce job, the intermediate map results will
> be still stored in HDFS, as it happens to hadoop, is this right? I think
> the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
> hard part spark or hadoop does, the shuffling, could be done out of the box
> with Cassandra, but no one takes advantage on that. What if a map / reduce
> job used a temporary CF in Cassandra to store intermediate results?
>
> From: user@cassandra.apache.org
> Subject: Re: How to speed up SELECT * query in Cassandra
>
> I use spark with cassandra, and you dont need DSE.
>
> I see a lot of people ask this same question below (how do I get a lot of
> data out of cassandra?), and my question is always, why arent you updating
> both places at once?
>
> For example, we use hadoop and cassandra in conjunction with each other,
> we use a message bus to store every event in both, aggregrate in both, but
> only keep current data in cassandra (cassandra makes a very poor
> datawarehouse ot long term time series store) and then use services to
> process queries that merge data from hadoop and cassandra.
>
> Also, spark on hdfs gives more flexibility in terms of large datasets and
> performance.  The very nature of cassandra's distributed nature vs
> partitioning data on hadoop makes spark on hdfs actually fasted than on
> cassandra
>
>
>
> --
> *Colin Clark*
> +1 612 859 6129
> Skype colin.p.clark
>
> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
>
>
> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>> If you use Cassandra enterprise, you can use hive, AFAIK.
>
>
> Even better, you can use Spark/Shark with DSE.
>
> Cheers,
> Jens
>
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook <https://www.facebook.com/#!/tink.se> Linkedin
> <http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_photo&trkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary>
>  Twitter <https://twitter.com/tink>
>
>
>


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
> cassandra makes a very poor datawarehouse ot long term time series store

Really? This is not the impression I have... I think Cassandra is good to store 
larges amounts of data and historical information, it's only not good to store 
temporary data.
Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 

> The very nature of cassandra's distributed nature vs partitioning data on 
> hadoop makes spark on hdfs actually fasted than on cassandra.

I am not sure about the current state of Spark support for Cassandra, but I 
guess if you create a map reduce job, the intermediate map results will be 
still stored in HDFS, as it happens to hadoop, is this right? I think the 
problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part 
spark or hadoop does, the shuffling, could be done out of the box with 
Cassandra, but no one takes advantage on that. What if a map / reduce job used 
a temporary CF in Cassandra to store intermediate results?
From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra

I use spark with cassandra, and you dont need DSE.

I see a lot of people ask this same question below (how do I get a lot of data 
out of cassandra?), and my question is always, why arent you updating both 
places at once?

For example, we use hadoop and cassandra in conjunction with each other, we use 
a message bus to store every event in both, aggregrate in both, but only keep 
current data in cassandra (cassandra makes a very poor datawarehouse ot long 
term time series store) and then use services to process queries that merge 
data from hadoop and cassandra.  

Also, spark on hdfs gives more flexibility in terms of large datasets and 
performance.  The very nature of cassandra's distributed nature vs partitioning 
data on hadoop makes spark on hdfs actually fasted than on cassandra


--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:


On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 wrote:

If you use Cassandra enterprise, you can use hive, AFAIK.

Even better, you can use Spark/Shark with DSE.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook Linkedin Twitter




Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
The fastest way I am aware of is to do the queries in parallel to
multiple cassandra nodes and make sure that you only ask them for keys
they are responsible for. Otherwise, the node needs to resend your query
which is much slower and creates unnecessary objects (and thus GC pressure).

You can manually take advantage of the token range information, if the
driver does not get this into account for you. Then, you can play with
concurrency and batch size of a single query against one node.
Basically, what you/driver should do is to transform the query to series
of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".

I will need to look up the actual code, but the idea should be clear :)

Jirka H.


On 02/11/2015 11:26 AM, Ja Sam wrote:
> Is there a simple way (or even a complicated one) how can I speed up
> SELECT * FROM [table] query?
> I need to get all rows form one table every day. I split tables, and
> create one for each day, but still query is quite slow (200 millions
> of records)
>
> I was thinking about run this query in parallel, but I don't know if
> it is possible



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
I use spark with cassandra, and you dont need DSE.

I see a lot of people ask this same question below (how do I get a lot of data 
out of cassandra?), and my question is always, why arent you updating both 
places at once?

For example, we use hadoop and cassandra in conjunction with each other, we use 
a message bus to store every event in both, aggregrate in both, but only keep 
current data in cassandra (cassandra makes a very poor datawarehouse ot long 
term time series store) and then use services to process queries that merge 
data from hadoop and cassandra.  

Also, spark on hdfs gives more flexibility in terms of large datasets and 
performance.  The very nature of cassandra's distributed nature vs partitioning 
data on hadoop makes spark on hdfs actually fasted than on cassandra



--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 4:49 AM, Jens Rantil  wrote:
> 
> 
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>  wrote:
>> If you use Cassandra enterprise, you can use hive, AFAIK.
> 
> Even better, you can use Spark/Shark with DSE.
> 
> Cheers,
> Jens
> 
> 
> -- 
> Jens Rantil
> Backend engineer
> Tink AB
> 
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
> 
> Facebook Linkedin Twitter


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jens Rantil
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemil...@bloomberg.net> wrote:

> If you use Cassandra enterprise, you can use hive, AFAIK.


Even better, you can use Spark/Shark with DSE.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook  Linkedin

 Twitter