Re: Different number of records from COPY command

2015-06-03 Thread Josef Lindman Hörnlund

I ran into that issue a while ago and it was because I hit the tombstone limit 
on one of the nodes. Try running `nodetool compact adlog 
'adclicklog20150528.csv` and see if that helps.

Josef Lindman Hörnlund

 On 02 Jun 2015, at 17:48, Saurabh Chandolia s.chando...@gmail.com wrote:
 
 Still getting inconsistent number of records on consistency ALL and QUORUM. 
 Following is the output of consistency ALL and QUORUM.
 
 cqlsh:adlog CONSISTENCY ALL;
 Consistency level set to ALL.
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 58000 rows; Write: 3065.60 rows/s
 58463 rows exported in 21.353 seconds.
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 63000 rows; Write: 3517.03 rows/s
 63972 rows exported in 22.885 seconds.
 
 cqlsh:adlog CONSISTENCY QUORUM ;
 Consistency level set to QUORUM.
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 63000 rows; Write: 3443.37 rows/s
 63440 rows exported in 21.987 seconds.
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 65000 rows; Write: 3405.90 rows/s
 65524 rows exported in 24.053 seconds.
 
 
 - Saurabh
 
 On Tue, Jun 2, 2015 at 9:09 PM, Anuj Wadehra anujw_2...@yahoo.co.in 
 mailto:anujw_2...@yahoo.co.in wrote:
 I have never exported data myself but can u just try setting 'consistency 
 ALL' on cqlsh before executing command?
 
 Thanks
 Anuj Wadehra
 
 Sent from Yahoo Mail on Android 
 https://overview.mail.yahoo.com/mobile/?.src=Android
 From:Saurabh Chandolia s.chando...@gmail.com 
 mailto:s.chando...@gmail.com
 Date:Tue, 2 Jun, 2015 at 8:47 pm
 Subject:Different number of records from COPY command
 
 I am seeing different number of records each time I export a particular 
 table. There were no writes/reads in this table while exporting the data. I 
 am not able to understand why it is happening.
 Am I missing something here?
 
 Cassandra version: 2.1.4
 Java driver version: 2.1.5
 Cluster Size: 4 Nodes in same DC
 Keyspace Replication factor: 2
 
 Following commands were issued:
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 68000 rows; Write: 3025.93 rows/s
 68682 rows exported in 27.737 seconds.
 
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 65000 rows; Write: 2821.06 rows/s
 65535 rows exported in 26.667 seconds.
 
 cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv';
 Processed 66000 rows; Write: 3285.07 rows/s
 66055 rows exported in 26.269 seconds.
 
 
 cfstats for adlog.adclicklog20150528:
 ---
 $ nodetool cfstats adlog.adclicklog20150528
 Keyspace: adlog
   Read Count: 217
   Read Latency: 2.773073732718894 ms.
   Write Count: 103191
   Write Latency: 0.10233075558915021 ms.
   Pending Flushes: 0
   Table: adclicklog20150528
   SSTable count: 11
   Space used (live): 37981202
   Space used (total): 37981202
   Space used by snapshots (total): 13407843
   Off heap memory used (total): 25580
   SSTable Compression Ratio: 0.26684147550494164
   Number of keys (estimate): 5627
   Memtable cell count: 94620
   Memtable data size: 13459445
   Memtable off heap memory used: 0
   Memtable switch count: 19
   Local read count: 217
   Local read latency: 2.774 ms
   Local write count: 103191
   Local write latency: 0.103 ms
   Pending flushes: 0
   Bloom filter false positives: 0
   Bloom filter false ratio: 0.0
   Bloom filter space used: 7192
   Bloom filter off heap memory used: 7104
   Index summary off heap memory used: 980
   Compression metadata off heap memory used: 17496
   Compacted partition minimum bytes: 1110
   
   Compacted partition maximum bytes: 182785
   Compacted partition mean bytes: 27808
   Average live cells per slice (last five minutes): 
 44.663594470046085
   Maximum live cells per slice (last five minutes): 86.0
   Average tombstones per slice (last five minutes): 0.0
   Maximum tombstones per slice (last five minutes): 0.0
 
 
 
 - Saurabh
 



duplicate create table event

2015-06-03 Thread 高健峰
Hi,
 I am using Cassandra 2.1.3 and datastax driver 2.1.3. I have four
nodes in my datacenter. In Client I Just have 1 Cluster instance object.
Now when a table was created, My Client receives 236 CREATE TABLE EVENT in
1 second. So anybody knows why this happend?
 thanks!

-- 
--
Joseph Gao
PhoneNum:15210513582
QQ: 409343351


Re: How to set datastax-angent connect with jmx an

2015-06-03 Thread 贺伟平
yes,I use the same username/pw in jvisualvm ,it works.

Is the username/pw configure in opscenter UI or agent ?



when I add cluster in opscenter UI and fill in the jmx username/pw …….

  





发件人: Jason Wee
发送时间: ‎2015‎年‎6‎月‎3‎日 ‎星期三 ‎12‎:‎41
收件人: user@cassandra.apache.org





the error in the log output looks similar to this 
http://serverfault.com/questions/614810/opscenter-4-1-4-authentication-failing  
, in the opscenter 5.1.2 , do you configure the username/password same with the 
agent and cassandra node too?



jason



On Wed, Jun 3, 2015 at 11:13 AM, 贺伟平 wolai...@hotmail.com wrote:








​


I am using opscenter 5.1.2 and just enabled JMX username/password 
authentication on my Cassandra cluster. I think I've updated all my opscenter 
configs correctly to force the agents to use JMX auth, but it is not working.

I've updated the config under /etc/opscenter/Clusters/[cluster-name].conf with 
the following jmx properties
[jmx]
username=username
password=password
port=7199


I then restarted opscenter and opscenter agents, but see the following error in 
the opscenter agent logs:


INFO [main] 2015-06-03 10:55:53,910 Loading conf files: ./conf/address.yaml
  INFO [main] 2015-06-03 10:55:53,953 Java vendor/version: Java HotSpot(TM) 
64-Bit Server VM/1.7.0_51
  INFO [main] 2015-06-03 10:55:53,953 DataStax Agent version: 5.1.2
  INFO [main] 2015-06-03 10:55:54,010 Default config values: {:cassandra_port 
9042, :rollups300_ttl 2419200, :settings_cf settings, :agent_rpc_interface 
localhost, :restore_req_update_period 60, :my_channel_prefix /agent, 
:poll_period 60, :jmx_username heweiping, :thrift_conn_timeout 1, 
:rollups60_ttl 604800, :stomp_port 61620, :shorttime_interval 10, 
:longtime_interval 300, :max-seconds-to-sleep 25, :private-conf-props 
[initial_token listen_address broadcast_address rpc_address 
broadcast_rpc_address], :thrift_port 9160, :async_retry_timeout 5, 
:agent-conf-group global-cluster-agent-group, :jmx_host 127.0.0.1, 
:ec2_metadata_api_host 169.254.169.254, :metrics_enabled 1, :async_queue_size 
5000, :backup_staging_dir nil, :read-buffer-size 1000, :remote_verify_max 
30, :disk_usage_update_period 60, :throttle-bytes-per-second 50, 
:rollups7200_ttl 31536000, :agent_rpc_broadcast_address localhost, 
:remote_backup_retries 3, :ssl_keystore nil, :rollup_snapshot_period 300, 
:is_package false, :monitor_command 
/usr/share/datastax-agent/bin/datastax_agent_monitor, :thrift_socket_timeout 
5000, :remote_verify_initial_delay 1000, :cassandra_log_location 
/var/log/cassandra/system.log, :max-pending-repairs 5, :remote_backup_region 
us-west-1, :restore_on_transfer_failure false, :tmp_dir 
/var/lib/datastax-agent/tmp/, :config_md5 nil, :jmx_port 7299, 
:write-buffer-size 10, :jmx_metrics_threadpool_size 4, :use_ssl 0, 
:rollups86400_ttl 0, :nodedetails_threadpool_size 3, :api_port 61621, 
:kerberos_service nil, :backup_file_queue_max 1, :jmx_thread_pool_size 5, 
:production 1, :runs_sudo 1, :max_file_transfer_attempts 30, :jmx_password 
eefung, :stomp_interface 172.19.104.123, :storage_keyspace OpsCenter, 
:hosts [127.0.0.1], :rollup_snapshot_threshold 300, :jmx_retry_timeout 30, 
:unthrottled-default 100, :remote_backup_retry_delay 5000, 
:remote_backup_timeout 1000, :seconds-to-read-kill-channel 0.005, 
:realtime_interval 5, :pdps_ttl 259200}
  INFO [main] 2015-06-03 10:55:54,174 Waiting for the config from OpsCenter
  INFO [main] 2015-06-03 10:55:54,175 Attempting to determine Cassandra's 
broadcast address through JMX
  INFO [main] 2015-06-03 10:55:54,176 Starting Stomp
  INFO [main] 2015-06-03 10:55:54,176 Starting up agent communcation with 
OpsCenter.
  INFO [Initialization] 2015-06-03 10:55:54,180 New JMX connection 
(127.0.0.1:7299)
  WARN [Initialization] 2015-06-03 10:55:54,409 Error when trying to match our 
local token: java.lang.SecurityException: Authentication failed! Credentials 
required
  INFO [main] 2015-06-03 10:55:59,412 Reconnecting to a backup OpsCenter 
instance
  INFO [main] 2015-06-03 10:55:59,413 SSL communication is disabled
  INFO [main] 2015-06-03 10:55:59,413 Creating stomp connection to 
172.19.104.123:61620
  INFO [Initialization] 2015-06-03 10:55:59,418 Sleeping for 2s before trying 
to determine IP over JMX again
  WARN [clojure-agent-send-off-pool-0] 2015-06-03 10:55:59,422 Tried to send 
message while not connected: /conf-request 
[[172.19.104.123,0:0:0:0:0:0:0:1%1,fe80:0:0:0:225:90ff:fe6a:d35c%2,127.0.0.1],[5.1.2,\/437054467\/conf]]
  INFO [StompConnection receiver] 2015-06-03 10:55:59,423 Reconnecting in 0s.
  INFO [StompConnection receiver] 2015-06-03 10:55:59,424 Connected to 
172.19.104.123:61620
  INFO [main] 2015-06-03 10:55:59,432 Starting Jetty server: {:join? false, 
:ssl? false, :host localhost, :port 61621}





Checks with other jmx based tools (nodetool, jmxtrans) confirm that the jmx
setup is correct.




Any ideals ?




Thank you very much!








发自 Windows 邮件

How to store denormalized data

2015-06-03 Thread Matthew Johnson
Hi all,



I am trying to store some data (user actions in our application) for future
analysis (probably using Spark). I understand best practice is to store it
in denormalized form, and this will definitely make some of our future
queries much easier. But I have a problem with denormalizing the data.



For example, let’s say one of my queries is “the number of reports
generated by user type”. In the part of the application that the user
connects to to generate reports, we only have access to the user id. In a
traditional RDBMS, this is fine, because at query time you join the user id
onto the users table and get all the user data associated with that user.
But how do I populate extra fields like user type on the fly?



My ideas so far:

1.   I try and maintain an in-memory cache of data such as “user”, and
do a lookup to this cache for every user action and store the user data
with it. #PROS: fast #CONS: not scalable, will run out of memory if data
sets grow

2.   For each user action, I do a call to RDBMS and look up the data
for the user in question, then store the user action plus the user data as
a single row. #PROS easy to scale #CONS slow

3.   I write only the user id and the action straight away, and have a
separate batch process that periodically goes through my table looking for
rows without user data, and looks up the user data from RDBMS and populates
it





None of these solutions seem ideal to me. Does Cassandra have something
like ‘triggers’, where I can set up a table to automatically populate some
rows based on a lookup from another table? Or perhaps Spark or some other
library has built-in functionality that solves exactly this problem?



Any suggestions much appreciated.



Thanks,

Matthew


Re: How to store denormalized data

2015-06-03 Thread Shahab Yunus
Suggestion or rather food for thought

Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?

Regards,
Shahab

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com
wrote:

 Hi all,



 I am trying to store some data (user actions in our application) for
 future analysis (probably using Spark). I understand best practice is to
 store it in denormalized form, and this will definitely make some of our
 future queries much easier. But I have a problem with denormalizing the
 data.



 For example, let’s say one of my queries is “the number of reports
 generated by user type”. In the part of the application that the user
 connects to to generate reports, we only have access to the user id. In a
 traditional RDBMS, this is fine, because at query time you join the user id
 onto the users table and get all the user data associated with that user.
 But how do I populate extra fields like user type on the fly?



 My ideas so far:

 1.   I try and maintain an in-memory cache of data such as “user”,
 and do a lookup to this cache for every user action and store the user data
 with it. #PROS: fast #CONS: not scalable, will run out of memory if data
 sets grow

 2.   For each user action, I do a call to RDBMS and look up the data
 for the user in question, then store the user action plus the user data as
 a single row. #PROS easy to scale #CONS slow

 3.   I write only the user id and the action straight away, and have
 a separate batch process that periodically goes through my table looking
 for rows without user data, and looks up the user data from RDBMS and
 populates it





 None of these solutions seem ideal to me. Does Cassandra have something
 like ‘triggers’, where I can set up a table to automatically populate some
 rows based on a lookup from another table? Or perhaps Spark or some other
 library has built-in functionality that solves exactly this problem?



 Any suggestions much appreciated.



 Thanks,

 Matthew





Re: How to interpret some GC logs

2015-06-03 Thread Carlos Rolo
GC Logs are a weird science. I use a couple of resources to get through
them. Regarding your question my 1.8.0_40 always have the first the -. I
greped through 2h of logs, and on a test environment.

I use the following set of options:

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1

I don't know if the following could help but I use this tool to visualize
my GC behaviour: https://github.com/chewiebug/GCViewer
If I need insight into the GC as it goes I use Java VisualVM.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
http://linkedin.com/in/carlosjuzarterolo*
Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649
www.pythian.com

On Tue, Jun 2, 2015 at 9:17 AM, Michał Łowicki mlowi...@gmail.com wrote:



 On Tue, Jun 2, 2015 at 9:06 AM, Sebastian Martinka 
 sebastian.marti...@mercateo.com wrote:

  this should help you:

 https://blogs.oracle.com/poonam/entry/understanding_cms_gc_logs


 I don't see there such format. Passed options related to GC are:

 -XX:+PrintGCDateStamps -Xloggc:/var/log/cassandra/gc.log




 Best Regards,
 Sebastian Martinka



 *Von:* Michał Łowicki [mailto:mlowi...@gmail.com]
 *Gesendet:* Montag, 1. Juni 2015 11:47
 *An:* user@cassandra.apache.org
 *Betreff:* How to interpret some GC logs



 Hi,



 Normally I get logs like:



 2015-06-01T09:19:50.610+: 4736.314: [GC 6505591K-4895804K(8178944K),
 0.0494560 secs]



 which is fine and understandable but occasionalIy I see something like:

 2015-06-01T09:19:50.661+: 4736.365: [GC 4901600K(8178944K), 0.0049600
 secs]

 How to interpret it? Does it miss only part before - so memory
 occupied before GC cycle?

 --

 BR,
 Michał Łowicki




 --
 BR,
 Michał Łowicki


-- 


--





Re: Spark SQL JDBC Server + DSE

2015-06-03 Thread Brian O'Neill

Kudos Ben.  We¹ve been tracking Zeppelin, and considered doing the same
thing.
You beat us to it.  Well done.

-brian

---
Brian O'Neill 
Chief Technology Officer
Health Market Science, a LexisNexis Company
215.588.6024 Mobile € @boneill42 http://www.twitter.com/boneill42


This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or the
person responsible to deliver it to the intended recipient, please contact
the sender at the email above and delete this email and any attachments and
destroy any copies thereof. Any review, retransmission, dissemination,
copying or other use of, or taking any action in reliance upon, this
information by persons or entities other than the intended recipient is
strictly prohibited.
 


From:  Ben Bromhead b...@instaclustr.com
Reply-To:  user@cassandra.apache.org
Date:  Tuesday, June 2, 2015 at 5:05 PM
To:  user@cassandra.apache.org
Subject:  Re: Spark SQL JDBC Server + DSE

If you want a web based notebook style approach (similar to ipython) check
out https://github.com/apache/incubator-zeppelin

And https://github.com/apache/incubator-zeppelin/pull/86

Bonus free pretty graphs!

On 1 June 2015 at 11:41, Sebastian Estevez sebastian.este...@datastax.com
wrote:
 Have you looked at job server?
 
 https://github.com/spark-jobserver/spark-jobserver
 https://www.youtube.com/watch?v=8k9ToZ4m6os
 http://planetcassandra.org/blog/post/fast-spark-queries-on-in-memory-datasets/
 
 All the best,
 
 
  http://www.datastax.com/
 Sebastián Estévez
 Solutions Architect | 954 905 8615 tel:954%20905%208615  |
 sebastian.este...@datastax.com
  https://www.linkedin.com/company/datastax
 https://www.facebook.com/datastax   https://twitter.com/datastax
 https://plus.google.com/+Datastax/about
 http://feeds.feedburner.com/datastax
 
  http://cassandrasummit-datastax.com/
 
 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world¹s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the database
 technology and transactional backbone of choice for the worlds most innovative
 companies such as Netflix, Adobe, Intuit, and eBay.
 
 On Mon, Jun 1, 2015 at 8:13 AM, Mohammed Guller moham...@glassbeam.com
 wrote:
 Brian,
 We haven¹t open sourced the REST server, but not  opposed to doing it. Just
 need to carve out some time to clean up the code and carve it out from all
 the other stuff that we do in that REST server.  Will try to do it in the
 next few weeks. If you need it sooner, let me know.
  
 I did consider the option of writing our own Spark SQL JDBC driver for C*,
 but it is lower on the priority list right now.
  
 
 Mohammed
  
 
 From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill
 Sent: Saturday, May 30, 2015 3:12 AM
 
 
 To: user@cassandra.apache.org
 Subject: Re: Spark SQL JDBC Server + DSE
  
 
  
 
 Any chance you open-sourced, or could open-source the REST server? ;)
 
  
 
 In thinking about itŠ
 
 It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver
 against Cassandra, akin to what they have for hive:
 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-t
 hrift-jdbcodbc-server
 
  
 
 I wouldn¹t mind collaborating on that, if you are headed in that direction.
 
 (and then I could write the REST server on top of that)
 
  
 
 LMK,
 
  
 
 -brian
 
  
 
 ---
 Brian O'Neill 
 Chief Technology Officer
 Health Market Science, a LexisNexis Company
 215.588.6024 tel:215.588.6024  Mobile € @boneill42
 http://www.twitter.com/boneill42
  
 This information transmitted in this email message is for the intended
 recipient only and may contain confidential and/or privileged material. If
 you received this email in error and are not the intended recipient, or the
 person responsible to deliver it to the intended recipient, please contact
 the sender at the email above and delete this email and any attachments and
 destroy any copies thereof. Any review, retransmission, dissemination,
 copying or other use of, or taking any action in reliance upon, this
 information by persons or entities other than the intended recipient is
 strictly prohibited.
  
 
  
 
 From: Mohammed Guller moham...@glassbeam.com
 Reply-To: user@cassandra.apache.org
 Date: Friday, May 29, 2015 at 2:15 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: RE: Spark SQL JDBC Server + DSE
 
  
 
 Brian,
 I implemented a similar REST server last year and it works great. Now we have
 a requirement to support JDBC connectivity in addition to the REST API. We
 want to allow users to use tools like Tableau to connect to C* through the
 Spark SQL JDBC/Thift server.
  
 
 Mohammed
  
 
 From: Brian O'Neill 

RE: How to store denormalized data

2015-06-03 Thread Matthew Johnson
Thanks Shahab,



That was my initial thought. The downside I can think of for that approach
is if/when we decide to use this data to serve suggestions in real time
back to the users (in a sort of “if you clicked on this you might also like
to click on this”) and the algorithms for that would need to be driven off
the extra columns. Having said that, we are nowhere near that stage with
our application, so I could opt for the batch approach for now and cross
that bridge when we come to it! Just wondering if anyone else has already
solved this in a really elegant way already :)



Cheers,

Matthew



*From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
*Sent:* 03 June 2015 15:55
*To:* user@cassandra.apache.org
*Subject:* Re: How to store denormalized data



Suggestion or rather food for thought



Do you expect to read/analyze the written data right away? Or will it be a
batch process, kicked off later in time? What I am trying to say is that if
the 'read/analysis' part is a) batch process and b) kicked off later in
time, then #3 is a fine solution? What harm in it? Also, you can slightly
change it, (if applicable) and not populate as a separate batch process but
in fact make part of  your analysis job? Kind of a pre-process/prep step?



Regards,

Shahab



On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com
wrote:

Hi all,



I am trying to store some data (user actions in our application) for future
analysis (probably using Spark). I understand best practice is to store it
in denormalized form, and this will definitely make some of our future
queries much easier. But I have a problem with denormalizing the data.



For example, let’s say one of my queries is “the number of reports
generated by user type”. In the part of the application that the user
connects to to generate reports, we only have access to the user id. In a
traditional RDBMS, this is fine, because at query time you join the user id
onto the users table and get all the user data associated with that user.
But how do I populate extra fields like user type on the fly?



My ideas so far:

1.   I try and maintain an in-memory cache of data such as “user”, and
do a lookup to this cache for every user action and store the user data
with it. #PROS: fast #CONS: not scalable, will run out of memory if data
sets grow

2.   For each user action, I do a call to RDBMS and look up the data
for the user in question, then store the user action plus the user data as
a single row. #PROS easy to scale #CONS slow

3.   I write only the user id and the action straight away, and have a
separate batch process that periodically goes through my table looking for
rows without user data, and looks up the user data from RDBMS and populates
it





None of these solutions seem ideal to me. Does Cassandra have something
like ‘triggers’, where I can set up a table to automatically populate some
rows based on a lookup from another table? Or perhaps Spark or some other
library has built-in functionality that solves exactly this problem?



Any suggestions much appreciated.



Thanks,

Matthew


Re: How to store denormalized data

2015-06-03 Thread Jack Krupansky
Your requirement is still not quite clear - are you counting users or
reports, or reports of a type for each user, or...?

You can have a separate table, with the partition key being the user type,
and using the user id as a clustering column - provided that the number of
users is only thousands or no more than low millions. Then write a row
whenever a report is generated for a given type and user ID. Do you need to
count multiple instances of the same report for a given user? If so, you
can use a time stamp as an additional clustering column.


-- Jack Krupansky

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com
wrote:

 Hi all,



 I am trying to store some data (user actions in our application) for
 future analysis (probably using Spark). I understand best practice is to
 store it in denormalized form, and this will definitely make some of our
 future queries much easier. But I have a problem with denormalizing the
 data.



 For example, let’s say one of my queries is “the number of reports
 generated by user type”. In the part of the application that the user
 connects to to generate reports, we only have access to the user id. In a
 traditional RDBMS, this is fine, because at query time you join the user id
 onto the users table and get all the user data associated with that user.
 But how do I populate extra fields like user type on the fly?



 My ideas so far:

 1.   I try and maintain an in-memory cache of data such as “user”,
 and do a lookup to this cache for every user action and store the user data
 with it. #PROS: fast #CONS: not scalable, will run out of memory if data
 sets grow

 2.   For each user action, I do a call to RDBMS and look up the data
 for the user in question, then store the user action plus the user data as
 a single row. #PROS easy to scale #CONS slow

 3.   I write only the user id and the action straight away, and have
 a separate batch process that periodically goes through my table looking
 for rows without user data, and looks up the user data from RDBMS and
 populates it





 None of these solutions seem ideal to me. Does Cassandra have something
 like ‘triggers’, where I can set up a table to automatically populate some
 rows based on a lookup from another table? Or perhaps Spark or some other
 library has built-in functionality that solves exactly this problem?



 Any suggestions much appreciated.



 Thanks,

 Matthew