Re: How to speed up SELECT * query in Cassandra

2015-02-16 Thread mck

 Could you please share how much data you store on the cluster and what
 is HW configuration of the nodes? 


These nodes are dedicated HW, 24 cpu and 50Gb ram.
Each node has a few TBs of data (you don't want to go over this) in
raid50 (we're migrating over to JBOD).
Each c* node is running 2.0.11 and configured to use 8gm heap, 2g new,
and jdk1.7.0_55.

Hadoop (2.2.0) tasktrackers and dfs run on these nodes as well, all up
they use up to 12Gb ram, leaving ~30Gb ram for kernel and page cache.
Data-locality is an important goal, in the worse case scenarios we've
seen it mean a four times throughput benefit.

Hdfs being a volatile hadoop-internals space for us is on SSDs,
providing strong m/r performance.
 (commitlog of course is also on SSD – we made the mistake of putting it
 on the same SSD to begin with. don't do that, commitlog gets its own
 SSD)


 I am really impressed that you are
 able to read 100M records in ~4minutes on 4 nodes. It makes something
 like 100k reads per node, which is something we are quite far away from.


These are not individual reads and not the number of partition keys, but
m/r records (or cql rows).
But yes, the performance of spark against cassandra is impressive.


 It leads me to question, whether reading from Spark goes through
 Cassandra's JVM and thus go through normal read path, or if it reads the
 sstables directly from disks sequentially and possibly filters out
 old/tombstone values by itself?


Both Hadoop-Cassandra integration and the Spark-Cassandra connector goes
through the normal read path like all cql read queries.

With our m/r jobs each task works with just one partition key, doing
repeated column slice reads through that partition key according to the
ConfigHelper.rangeBatchSize setting, which we have set to 100. These
hadoop jobs use a custom written CqlInputFormat due to the poor
performance CqlInputFormat has today against a vnodes setup, the
customisation we have is pretty much the same as the patch on offer in
CASSANDRA-6091.

This problem with vnodes we haven't experienced with the spark
connector.
I presume that, like the hadoop integration, spark also bulk reads
(column slices) from each partition key.

Otherwise this is useful reading
http://wiki.apache.org/cassandra/HadoopSupport#Troubleshooting


 This is also a cluster that serves requests to web applications that
 need low latency.

Let it be said this isn't something i'd recommend, just the path we had
to take because of our small initial dedicated-HW cluster.
(You really want to separate online and offline datacenters, so that you
can maximise the offline clusters for the heavy batch reads).

~mck


Re: How to speed up SELECT * query in Cassandra

2015-02-14 Thread mck
Jirka,

 But I am really interested how it can work well with Spark/Hadoop where
 you basically needs to read all the data as well (as far as I understand
 that).


I can't give you any benchmarking between technologies (nor am i
particularly interested in getting involved in such a discussion) but i
can share our experiences with Cassandra, Hadoop, and Spark, over the
past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
choice.

On a four node cluster we were running 5000+ small hadoop jobs each day
each finishing within two minutes, often within one minute, resulting in
(give or take) a billion records read and 150 millions records written
from and to c*.
These small jobs are incrementally processing on limited partition key
sets each time. These jobs are primarily reading data from a raw events
store that has a ttl of 3 months and 22+Gb of tombstones a day (reads
over old partition keys are rare).

We also run full-table-scan jobs and have never come across any issues
particular to that. There are hadoop map/reduce settings to increase
durability if you have tables with troublesome partition keys.

This is also a cluster that serves requests to web applications that
need low latency.

We recently wrote a spark job that does full table scans over 100
million+ rows, involves a handful of stages (two tables, 9 maps, 4
reduce, and 2 joins), and writes back to a new table 5 millions rows.
This job runs in ~260 seconds.

Spark is becoming a natural complement to schema evolution for
cassandra, something you'll want to do to keep your schema optimised
against your read request patterns, even little things like switching
cluster keys around. 

With any new technology hitting some hurdles (especially if you go
wondering outside recommended practices) will of course be part of the
game, but that said I've only had positive experiences with this
community's ability to help out (and do so quickly).

Starting from scratch i'd use Spark (on scala) over Hadoop no questions
asked. 
Otherwise Cassandra has always been our 'big data' platform,
hadoop/spark is just an extra tool on top.
We've never kept data in hdfs and are very grateful for having made that
choice.

~mck

ref
https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/


Re: How to speed up SELECT * query in Cassandra

2015-02-13 Thread Jens Rantil
If you are using Spark you need to be _really_ careful about your
tombstones. In our experience a single partition with too many tombstones
can take down the whole batch job (until something like
https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a
major obstacle for us to overcome when using Spark.

Cheers,
Jens

On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky ho...@avast.com wrote:

  Well, I always wondered how Cassandra can by used in Hadoop-like
 environment where you basically need to do full table scan.

 I need to say that our experience is that cassandra is perfect for
 writing, reading specific values by key, but definitely not for reading all
 of the data out of it. Some of our projects found out that doing that with
 a not trivial in a timely manner is close to impossible in many situations.
 We are slowly moving to storing the data in HDFS and possibly reprocess
 them on a daily bases for such usecases (statistics).

 This is nothing against Cassandra, it can not be perfect for everything.
 But I am really interested how it can work well with Spark/Hadoop where you
 basically needs to read all the data as well (as far as I understand that).

 Jirka H.


 On 02/11/2015 01:51 PM, DuyHai Doan wrote:

 The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra

  Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

   cassandra makes a very poor datawarehouse ot long term time series
 store

  Really? This is not the impression I have... I think Cassandra is good
 to store larges amounts of data and historical information, it's only not
 good to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

   The very nature of cassandra's distributed nature vs partitioning
 data on hadoop makes spark on hdfs actually fasted than on cassandra.

  I am not sure about the current state of Spark support for Cassandra,
 but I guess if you create a map reduce job, the intermediate map results
 will be still stored in HDFS, as it happens to hadoop, is this right? I
 think the problem with Spark + Cassandra or with Hadoop + Cassandra is that
 the hard part spark or hadoop does, the shuffling, could be done out of the
 box with Cassandra, but no one takes advantage on that. What if a map /
 reduce job used a temporary CF in Cassandra to store intermediate results?

   From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

  I see a lot of people ask this same question below (how do I get a lot
 of data out of cassandra?), and my question is always, why arent you
 updating both places at once?

  For example, we use hadoop and cassandra in conjunction with each
 other, we use a message bus to store every event in both, aggregrate in
 both, but only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

  Also, spark on hdfs gives more flexibility in terms of large datasets
 and performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
  +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

  Cheers,
 Jens


  --
  Jens Rantil
 Backend engineer
 Tink AB

  Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

  Facebook https://www.facebook.com/#%21/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink







-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: How to speed up SELECT * query in Cassandra

2015-02-12 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks Jirka!

From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra

  Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

 loop { lastRow = val query =   lastRow match { case 
Some(row)   = nextPageQuery(row, upperLimit) case None   = 
initialQuery(lowerLimit) }  session.execute(query).all }
  

 private def nextPageQuery(row: Row, 
upperLimit: String): String = {   val   tokenPart = 
token(%s)  token(0x%s) and token(%s)%s.format(rowKeyName,   
  hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)   
basicQuery.format(tokenPart)   }

  
  private def initialQuery(lowerLimit:   String):   String  
 = { val tokenPart =   token(%s) = %s.format(rowKeyName,   
lowerLimit)  basicQuery.format(tokenPart) } private def 
calculateRanges:   (BigDecimal,   BigDecimal,   
IndexedSeq[(BigDecimal,   BigDecimal)])   = { tokenRange match {
 case Some((start,   end)) = Logger.info(Token range given: {},  
+   start.underlying.toPlainString + ,+ 
end.underlying.toPlainString + ) val tokenSpaceSize =   end - start  
   val rangeSize =   tokenSpaceSize / concurrency val ranges = for (i 
- 0 until concurrency) yield (start + (i *   rangeSize), start + ((i + 1)  
 * rangeSize))  (tokenSpaceSize, rangeSize, ranges) 
case None   = val tokenSpaceSize =   partitioner.max - 
partitioner.min val rangeSize =   tokenSpaceSize / concurrency val 
ranges = for (i - 0 until concurrency) yield (partitioner.min +   (i * 
rangeSize), partitioner.min + ((i + 1) * rangeSize))  
(tokenSpaceSize, rangeSize, ranges) } }

  private   val basicQuery =   { 
select %s, %s, %s, writetime(%s) from %s where %s%s limit 
%d%s.format( rowKeyName, columnKeyName, 
columnValueName, columnValueName, columnFamily, %s, 
// template whereCondition, pageSize, if   
(cqlAllowFiltering)  allow filtering else  ) }
  

 case object   Murmur3  
 extends   Partitioner   {   override   
  val   min = BigDecimal(-2).pow(63)   
override val   max = BigDecimal(2).pow(63)  
   - 1   } case object  
 Random   extends   Partitioner   { 
  override val   min = BigDecimal(0)
   override val   max = 
BigDecimal(2).pow(127) - 1   }  
  
  
 
On 02/11/2015 02:21 PM, Ja Sam wrote:

Your answer looks very promising 


 How do you calculate start and stop?


On Wed, Feb 11, 2015 at 12:09 PM, Jiri   Horky ho...@avast.com
   wrote:
  
The fastest way I am aware of is to do the queries in parallel  
   to
multiple cassandra nodes and make sure that you only ask
 them for keys
they are responsible for. Otherwise, the node needs to 
resend your query
which is much slower and creates unnecessary objects (and   
  thus GC pressure).

You can manually take advantage of the token range 
information, if the
driver does not get this into account for you. Then, you can
 play with
concurrency and batch size of a single query against one
 node.
Basically, what you/driver should do is to transform the
 query to series
of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

I will need to look up the actual code, but the idea should 
be clear :)

Jirka H.



On 02/11/2015 11:26 AM, Ja Sam wrote:
 Is there a simple way (or even a complicated one) 
how can I speed up
 SELECT * FROM [table] query?
 I need to get all rows form one table every day. I
 split tables, and
 create one for each day, but still query is quite 
slow (200 millions
 of records)

 I was thinking about run this query in parallel,  
   but I don't know

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.

I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
 The very nature of cassandra's distributed nature vs partitioning
 data on hadoop makes spark on hdfs actually fasted than on cassandra

 Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is
 good to store larges amounts of data and historical information,
 it's only not good to store temporary data.
 Netflix has a large amount of data and it's all stored in
 Cassandra, AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data 
 on hadoop makes spark on hdfs
 actually fasted than on cassandra.

 I am not sure about the current state of Spark support for
 Cassandra, but I guess if you create a map reduce job, the
 intermediate map results will be still stored in HDFS, as it
 happens to hadoop, is this right? I think the problem with Spark +
 Cassandra or with Hadoop + Cassandra is that the hard part spark
 or hadoop does, the shuffling, could be done out of the box with
 Cassandra, but no one takes advantage on that. What if a map /
 reduce job used a temporary CF in Cassandra to store intermediate
 results?

 From: user@cassandra.apache.org mailto:user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I
 get a lot of data out of cassandra?), and my question is
 always, why arent you updating both places at once?

 For example, we use hadoop and cassandra in conjunction with
 each other, we use a message bus to store every event in both,
 aggregrate in both, but only keep current data in cassandra
 (cassandra makes a very poor datawarehouse ot long term time
 series store) and then use services to process queries that
 merge data from hadoop and cassandra.  

 Also, spark on hdfs gives more flexibility in terms of large
 datasets and performance.  The very nature of cassandra's
 distributed nature vs partitioning data on hadoop makes spark
 on hdfs actually fasted than on cassandra



 -- 
 *Colin Clark* 
 +1 612 859 6129 tel:%2B1%20612%20859%206129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se
 mailto:jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
 LONDON) mvallemil...@bloomberg.net
 mailto:mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 -- 
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se mailto:jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se http://www.tink.se/

 Facebook https://www.facebook.com/#%21/tink.se Linkedin
 
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter
 https://twitter.com/tink






Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies.
Look at Burden of proof
http://en.wikipedia.org/wiki/Philosophic_burden_of_proof

You stated The very nature of cassandra's distributed nature vs
partitioning data on hadoop makes spark on hdfs actually fasted than on
cassandra

It's up to YOU to prove it right, not up to me to prove it wrong.

All other bla bla is troll.

Come back to me once you get some decent benchmarks supporting your
statement, until then, the question is closed.



On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote:

 Did you want me to included specific examples from my employment at
 datastax or start from the ground up?

 All spark is on cassandra is a better than the previous use of hive.

 The fact that datastax hasnt provided any benchmarks themselves other than
 glossy marketing statements pretty much says it all-where are your
 benchmarks?  Maybe you could combine it with the in memory option to really
 boogie...

 :)

 (If I find time, I might just write a blog post about exactly how to do
 this-it involves the use of parquet and partitioning with clustering-and it
 doesnt cost anything to do it-and it's in production at my company)
 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:

 The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra

 Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is good to
 store larges amounts of data and historical information, it's only not good
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra.

 I am not sure about the current state of Spark support for Cassandra, but
 I guess if you create a map reduce job, the intermediate map results will
 be still stored in HDFS, as it happens to hadoop, is this right? I think
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
 hard part spark or hadoop does, the shuffling, could be done out of the box
 with Cassandra, but no one takes advantage on that. What if a map / reduce
 job used a temporary CF in Cassandra to store intermediate results?

 From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I get a lot of
 data out of cassandra?), and my question is always, why arent you updating
 both places at once?

 For example, we use hadoop and cassandra in conjunction with each other,
 we use a message bus to store every event in both, aggregrate in both, but
 only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

 Also, spark on hdfs gives more flexibility in terms of large datasets and
 performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink






Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
Did you want me to included specific examples from my employment at datastax or 
start from the ground up? 

All spark is on cassandra is a better than the previous use of hive. 

The fact that datastax hasnt provided any benchmarks themselves other than 
glossy marketing statements pretty much says it all-where are your benchmarks?  
Maybe you could combine it with the in memory option to really boogie...

:)

(If I find time, I might just write a blog post about exactly how to do this-it 
involves the use of parquet and partitioning with clustering-and it doesnt cost 
anything to do it-and it's in production at my company)
--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 The very nature of cassandra's distributed nature vs partitioning data on 
 hadoop makes spark on hdfs actually fasted than on cassandra
 
 Prove it. Did you ever have a look into the source code of the 
 Spark/Cassandra connector to see how data locality is achieved before 
 throwing out such statement ?
 
 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
  cassandra makes a very poor datawarehouse ot long term time series store
 
 Really? This is not the impression I have... I think Cassandra is good to 
 store larges amounts of data and historical information, it's only not good 
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 
 
  The very nature of cassandra's distributed nature vs partitioning data on 
  hadoop makes spark on hdfs actually fasted than on cassandra.
 
 I am not sure about the current state of Spark support for Cassandra, but I 
 guess if you create a map reduce job, the intermediate map results will be 
 still stored in HDFS, as it happens to hadoop, is this right? I think the 
 problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard 
 part spark or hadoop does, the shuffling, could be done out of the box with 
 Cassandra, but no one takes advantage on that. What if a map / reduce job 
 used a temporary CF in Cassandra to store intermediate results?
 
 From: user@cassandra.apache.org 
 Subject: Re: How to speed up SELECT * query in Cassandra
 I use spark with cassandra, and you dont need DSE.
 
 I see a lot of people ask this same question below (how do I get a lot of 
 data out of cassandra?), and my question is always, why arent you updating 
 both places at once?
 
 For example, we use hadoop and cassandra in conjunction with each other, we 
 use a message bus to store every event in both, aggregrate in both, but only 
 keep current data in cassandra (cassandra makes a very poor datawarehouse ot 
 long term time series store) and then use services to process queries that 
 merge data from hadoop and cassandra.  
 
 Also, spark on hdfs gives more flexibility in terms of large datasets and 
 performance.  The very nature of cassandra's distributed nature vs 
 partitioning data on hadoop makes spark on hdfs actually fasted than on 
 cassandra
 
 
 
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter
 
 


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
No, the question isnt closed.  You dont get to decide that.

I dont run a website making claims regarding cassandra and spark - your 
employer does.   

Again, where are your benchmarks?

I will publish mine, then we'll see what you've got.

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 8:39 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. 
 Look at Burden of proof
 
 You stated The very nature of cassandra's distributed nature vs partitioning 
 data on hadoop makes spark on hdfs actually fasted than on cassandra
 
 It's up to YOU to prove it right, not up to me to prove it wrong.
 
 All other bla bla is troll.
 
 Come back to me once you get some decent benchmarks supporting your 
 statement, until then, the question is closed.
 
 
 
 On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote:
 Did you want me to included specific examples from my employment at datastax 
 or start from the ground up? 
 
 All spark is on cassandra is a better than the previous use of hive. 
 
 The fact that datastax hasnt provided any benchmarks themselves other than 
 glossy marketing statements pretty much says it all-where are your 
 benchmarks?  Maybe you could combine it with the in memory option to really 
 boogie...
 
 :)
 
 (If I find time, I might just write a blog post about exactly how to do 
 this-it involves the use of parquet and partitioning with clustering-and it 
 doesnt cost anything to do it-and it's in production at my company)
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 The very nature of cassandra's distributed nature vs partitioning data on 
 hadoop makes spark on hdfs actually fasted than on cassandra
 
 Prove it. Did you ever have a look into the source code of the 
 Spark/Cassandra connector to see how data locality is achieved before 
 throwing out such statement ?
 
 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
  cassandra makes a very poor datawarehouse ot long term time series store
 
 Really? This is not the impression I have... I think Cassandra is good to 
 store larges amounts of data and historical information, it's only not 
 good to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra, 
 AFAIK. 
 
  The very nature of cassandra's distributed nature vs partitioning data 
  on hadoop makes spark on hdfs actually fasted than on cassandra.
 
 I am not sure about the current state of Spark support for Cassandra, but 
 I guess if you create a map reduce job, the intermediate map results will 
 be still stored in HDFS, as it happens to hadoop, is this right? I think 
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the 
 hard part spark or hadoop does, the shuffling, could be done out of the 
 box with Cassandra, but no one takes advantage on that. What if a map / 
 reduce job used a temporary CF in Cassandra to store intermediate results?
 
 From: user@cassandra.apache.org 
 Subject: Re: How to speed up SELECT * query in Cassandra
 I use spark with cassandra, and you dont need DSE.
 
 I see a lot of people ask this same question below (how do I get a lot of 
 data out of cassandra?), and my question is always, why arent you updating 
 both places at once?
 
 For example, we use hadoop and cassandra in conjunction with each other, 
 we use a message bus to store every event in both, aggregrate in both, but 
 only keep current data in cassandra (cassandra makes a very poor 
 datawarehouse ot long term time series store) and then use services to 
 process queries that merge data from hadoop and cassandra.  
 
 Also, spark on hdfs gives more flexibility in terms of large datasets and 
 performance.  The very nature of cassandra's distributed nature vs 
 partitioning data on hadoop makes spark on hdfs actually fasted than on 
 cassandra
 
 
 
 --
 Colin Clark 
 +1 612 859 6129
 Skype colin.p.clark
 
 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter
 


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

loop {lastRow =val query = lastRow match {case Some(row) =
nextPageQuery(row, upperLimit)case None =
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = token(%s)  token(0x%s) and token(%s) 
%s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
token(%s) = %s.format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =Logger.info(Token range
given: {},  + start.underlying.toPlainString + ,  +
end.underlying.toPlainString + )val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s,
// templatewhereCondition,pageSize,if (cqlAllowFiltering)  allow
filtering else )}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
 Your answer looks very promising

  How do you calculate start and stop?

 On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 The fastest way I am aware of is to do the queries in parallel to
 multiple cassandra nodes and make sure that you only ask them for keys
 they are responsible for. Otherwise, the node needs to resend your
 query
 which is much slower and creates unnecessary objects (and thus GC
 pressure).

 You can manually take advantage of the token range information, if the
 driver does not get this into account for you. Then, you can play with
 concurrency and batch size of a single query against one node.
 Basically, what you/driver should do is to transform the query to
 series
 of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

 I will need to look up the actual code, but the idea should be
 clear :)

 Jirka H.


 On 02/11/2015 11:26 AM, Ja Sam wrote:
  Is there a simple way (or even a complicated one) how can I speed up
  SELECT * FROM [table] query?
  I need to get all rows form one table every day. I split tables, and
  create one for each day, but still query is quite slow (200 millions
  of records)
 
  I was thinking about run this query in parallel, but I don't know if
  it is possible





How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Is there a simple way (or even a complicated one) how can I speed up SELECT
* FROM [table] query?
I need to get all rows form one table every day. I split tables, and create
one for each day, but still query is quite slow (200 millions of records)

I was thinking about run this query in parallel, but I don't know if it is
possible


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
I use spark with cassandra, and you dont need DSE.

I see a lot of people ask this same question below (how do I get a lot of data 
out of cassandra?), and my question is always, why arent you updating both 
places at once?

For example, we use hadoop and cassandra in conjunction with each other, we use 
a message bus to store every event in both, aggregrate in both, but only keep 
current data in cassandra (cassandra makes a very poor datawarehouse ot long 
term time series store) and then use services to process queries that merge 
data from hadoop and cassandra.  

Also, spark on hdfs gives more flexibility in terms of large datasets and 
performance.  The very nature of cassandra's distributed nature vs partitioning 
data on hadoop makes spark on hdfs actually fasted than on cassandra



--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:
 
 
 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:
 If you use Cassandra enterprise, you can use hive, AFAIK.
 
 Even better, you can use Spark/Shark with DSE.
 
 Cheers,
 Jens
 
 
 -- 
 Jens Rantil
 Backend engineer
 Tink AB
 
 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se
 
 Facebook Linkedin Twitter


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jens Rantil
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


Even better, you can use Spark/Shark with DSE.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
The fastest way I am aware of is to do the queries in parallel to
multiple cassandra nodes and make sure that you only ask them for keys
they are responsible for. Otherwise, the node needs to resend your query
which is much slower and creates unnecessary objects (and thus GC pressure).

You can manually take advantage of the token range information, if the
driver does not get this into account for you. Then, you can play with
concurrency and batch size of a single query against one node.
Basically, what you/driver should do is to transform the query to series
of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

I will need to look up the actual code, but the idea should be clear :)

Jirka H.


On 02/11/2015 11:26 AM, Ja Sam wrote:
 Is there a simple way (or even a complicated one) how can I speed up
 SELECT * FROM [table] query?
 I need to get all rows form one table every day. I split tables, and
 create one for each day, but still query is quite slow (200 millions
 of records)

 I was thinking about run this query in parallel, but I don't know if
 it is possible



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
The very nature of cassandra's distributed nature vs partitioning data on
hadoop makes spark on hdfs actually fasted than on cassandra

Prove it. Did you ever have a look into the source code of the
Spark/Cassandra connector to see how data locality is achieved before
throwing out such statement ?

On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is good to
 store larges amounts of data and historical information, it's only not good
 to store temporary data.
 Netflix has a large amount of data and it's all stored in Cassandra,
 AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data
 on hadoop makes spark on hdfs actually fasted than on cassandra.

 I am not sure about the current state of Spark support for Cassandra, but
 I guess if you create a map reduce job, the intermediate map results will
 be still stored in HDFS, as it happens to hadoop, is this right? I think
 the problem with Spark + Cassandra or with Hadoop + Cassandra is that the
 hard part spark or hadoop does, the shuffling, could be done out of the box
 with Cassandra, but no one takes advantage on that. What if a map / reduce
 job used a temporary CF in Cassandra to store intermediate results?

 From: user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I get a lot of
 data out of cassandra?), and my question is always, why arent you updating
 both places at once?

 For example, we use hadoop and cassandra in conjunction with each other,
 we use a message bus to store every event in both, aggregrate in both, but
 only keep current data in cassandra (cassandra makes a very poor
 datawarehouse ot long term time series store) and then use services to
 process queries that merge data from hadoop and cassandra.

 Also, spark on hdfs gives more flexibility in terms of large datasets and
 performance.  The very nature of cassandra's distributed nature vs
 partitioning data on hadoop makes spark on hdfs actually fasted than on
 cassandra



 --
 *Colin Clark*
 +1 612 859 6129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 --
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se

 Facebook https://www.facebook.com/#!/tink.se Linkedin
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter https://twitter.com/tink





Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Your answer looks very promising

 How do you calculate start and stop?

On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com wrote:

 The fastest way I am aware of is to do the queries in parallel to
 multiple cassandra nodes and make sure that you only ask them for keys
 they are responsible for. Otherwise, the node needs to resend your query
 which is much slower and creates unnecessary objects (and thus GC
 pressure).

 You can manually take advantage of the token range information, if the
 driver does not get this into account for you. Then, you can play with
 concurrency and batch size of a single query against one node.
 Basically, what you/driver should do is to transform the query to series
 of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

 I will need to look up the actual code, but the idea should be clear :)

 Jirka H.


 On 02/11/2015 11:26 AM, Ja Sam wrote:
  Is there a simple way (or even a complicated one) how can I speed up
  SELECT * FROM [table] query?
  I need to get all rows form one table every day. I split tables, and
  create one for each day, but still query is quite slow (200 millions
  of records)
 
  I was thinking about run this query in parallel, but I don't know if
  it is possible