Re: Different number of records from COPY command
I ran into that issue a while ago and it was because I hit the tombstone limit on one of the nodes. Try running `nodetool compact adlog 'adclicklog20150528.csv` and see if that helps. Josef Lindman Hörnlund On 02 Jun 2015, at 17:48, Saurabh Chandolia s.chando...@gmail.com wrote: Still getting inconsistent number of records on consistency ALL and QUORUM. Following is the output of consistency ALL and QUORUM. cqlsh:adlog CONSISTENCY ALL; Consistency level set to ALL. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 58000 rows; Write: 3065.60 rows/s 58463 rows exported in 21.353 seconds. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 63000 rows; Write: 3517.03 rows/s 63972 rows exported in 22.885 seconds. cqlsh:adlog CONSISTENCY QUORUM ; Consistency level set to QUORUM. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 63000 rows; Write: 3443.37 rows/s 63440 rows exported in 21.987 seconds. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 65000 rows; Write: 3405.90 rows/s 65524 rows exported in 24.053 seconds. - Saurabh On Tue, Jun 2, 2015 at 9:09 PM, Anuj Wadehra anujw_2...@yahoo.co.in mailto:anujw_2...@yahoo.co.in wrote: I have never exported data myself but can u just try setting 'consistency ALL' on cqlsh before executing command? Thanks Anuj Wadehra Sent from Yahoo Mail on Android https://overview.mail.yahoo.com/mobile/?.src=Android From:Saurabh Chandolia s.chando...@gmail.com mailto:s.chando...@gmail.com Date:Tue, 2 Jun, 2015 at 8:47 pm Subject:Different number of records from COPY command I am seeing different number of records each time I export a particular table. There were no writes/reads in this table while exporting the data. I am not able to understand why it is happening. Am I missing something here? Cassandra version: 2.1.4 Java driver version: 2.1.5 Cluster Size: 4 Nodes in same DC Keyspace Replication factor: 2 Following commands were issued: cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 68000 rows; Write: 3025.93 rows/s 68682 rows exported in 27.737 seconds. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 65000 rows; Write: 2821.06 rows/s 65535 rows exported in 26.667 seconds. cqlsh:adlog copy adclicklog20150528 (imprid) TO 'adclicklog20150528.csv'; Processed 66000 rows; Write: 3285.07 rows/s 66055 rows exported in 26.269 seconds. cfstats for adlog.adclicklog20150528: --- $ nodetool cfstats adlog.adclicklog20150528 Keyspace: adlog Read Count: 217 Read Latency: 2.773073732718894 ms. Write Count: 103191 Write Latency: 0.10233075558915021 ms. Pending Flushes: 0 Table: adclicklog20150528 SSTable count: 11 Space used (live): 37981202 Space used (total): 37981202 Space used by snapshots (total): 13407843 Off heap memory used (total): 25580 SSTable Compression Ratio: 0.26684147550494164 Number of keys (estimate): 5627 Memtable cell count: 94620 Memtable data size: 13459445 Memtable off heap memory used: 0 Memtable switch count: 19 Local read count: 217 Local read latency: 2.774 ms Local write count: 103191 Local write latency: 0.103 ms Pending flushes: 0 Bloom filter false positives: 0 Bloom filter false ratio: 0.0 Bloom filter space used: 7192 Bloom filter off heap memory used: 7104 Index summary off heap memory used: 980 Compression metadata off heap memory used: 17496 Compacted partition minimum bytes: 1110 Compacted partition maximum bytes: 182785 Compacted partition mean bytes: 27808 Average live cells per slice (last five minutes): 44.663594470046085 Maximum live cells per slice (last five minutes): 86.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 - Saurabh
duplicate create table event
Hi, I am using Cassandra 2.1.3 and datastax driver 2.1.3. I have four nodes in my datacenter. In Client I Just have 1 Cluster instance object. Now when a table was created, My Client receives 236 CREATE TABLE EVENT in 1 second. So anybody knows why this happend? thanks! -- -- Joseph Gao PhoneNum:15210513582 QQ: 409343351
Re: How to set datastax-angent connect with jmx an
yes,I use the same username/pw in jvisualvm ,it works. Is the username/pw configure in opscenter UI or agent ? when I add cluster in opscenter UI and fill in the jmx username/pw ……. 发件人: Jason Wee 发送时间: 2015年6月3日 星期三 12:41 收件人: user@cassandra.apache.org the error in the log output looks similar to this http://serverfault.com/questions/614810/opscenter-4-1-4-authentication-failing , in the opscenter 5.1.2 , do you configure the username/password same with the agent and cassandra node too? jason On Wed, Jun 3, 2015 at 11:13 AM, 贺伟平 wolai...@hotmail.com wrote: I am using opscenter 5.1.2 and just enabled JMX username/password authentication on my Cassandra cluster. I think I've updated all my opscenter configs correctly to force the agents to use JMX auth, but it is not working. I've updated the config under /etc/opscenter/Clusters/[cluster-name].conf with the following jmx properties [jmx] username=username password=password port=7199 I then restarted opscenter and opscenter agents, but see the following error in the opscenter agent logs: INFO [main] 2015-06-03 10:55:53,910 Loading conf files: ./conf/address.yaml INFO [main] 2015-06-03 10:55:53,953 Java vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.7.0_51 INFO [main] 2015-06-03 10:55:53,953 DataStax Agent version: 5.1.2 INFO [main] 2015-06-03 10:55:54,010 Default config values: {:cassandra_port 9042, :rollups300_ttl 2419200, :settings_cf settings, :agent_rpc_interface localhost, :restore_req_update_period 60, :my_channel_prefix /agent, :poll_period 60, :jmx_username heweiping, :thrift_conn_timeout 1, :rollups60_ttl 604800, :stomp_port 61620, :shorttime_interval 10, :longtime_interval 300, :max-seconds-to-sleep 25, :private-conf-props [initial_token listen_address broadcast_address rpc_address broadcast_rpc_address], :thrift_port 9160, :async_retry_timeout 5, :agent-conf-group global-cluster-agent-group, :jmx_host 127.0.0.1, :ec2_metadata_api_host 169.254.169.254, :metrics_enabled 1, :async_queue_size 5000, :backup_staging_dir nil, :read-buffer-size 1000, :remote_verify_max 30, :disk_usage_update_period 60, :throttle-bytes-per-second 50, :rollups7200_ttl 31536000, :agent_rpc_broadcast_address localhost, :remote_backup_retries 3, :ssl_keystore nil, :rollup_snapshot_period 300, :is_package false, :monitor_command /usr/share/datastax-agent/bin/datastax_agent_monitor, :thrift_socket_timeout 5000, :remote_verify_initial_delay 1000, :cassandra_log_location /var/log/cassandra/system.log, :max-pending-repairs 5, :remote_backup_region us-west-1, :restore_on_transfer_failure false, :tmp_dir /var/lib/datastax-agent/tmp/, :config_md5 nil, :jmx_port 7299, :write-buffer-size 10, :jmx_metrics_threadpool_size 4, :use_ssl 0, :rollups86400_ttl 0, :nodedetails_threadpool_size 3, :api_port 61621, :kerberos_service nil, :backup_file_queue_max 1, :jmx_thread_pool_size 5, :production 1, :runs_sudo 1, :max_file_transfer_attempts 30, :jmx_password eefung, :stomp_interface 172.19.104.123, :storage_keyspace OpsCenter, :hosts [127.0.0.1], :rollup_snapshot_threshold 300, :jmx_retry_timeout 30, :unthrottled-default 100, :remote_backup_retry_delay 5000, :remote_backup_timeout 1000, :seconds-to-read-kill-channel 0.005, :realtime_interval 5, :pdps_ttl 259200} INFO [main] 2015-06-03 10:55:54,174 Waiting for the config from OpsCenter INFO [main] 2015-06-03 10:55:54,175 Attempting to determine Cassandra's broadcast address through JMX INFO [main] 2015-06-03 10:55:54,176 Starting Stomp INFO [main] 2015-06-03 10:55:54,176 Starting up agent communcation with OpsCenter. INFO [Initialization] 2015-06-03 10:55:54,180 New JMX connection (127.0.0.1:7299) WARN [Initialization] 2015-06-03 10:55:54,409 Error when trying to match our local token: java.lang.SecurityException: Authentication failed! Credentials required INFO [main] 2015-06-03 10:55:59,412 Reconnecting to a backup OpsCenter instance INFO [main] 2015-06-03 10:55:59,413 SSL communication is disabled INFO [main] 2015-06-03 10:55:59,413 Creating stomp connection to 172.19.104.123:61620 INFO [Initialization] 2015-06-03 10:55:59,418 Sleeping for 2s before trying to determine IP over JMX again WARN [clojure-agent-send-off-pool-0] 2015-06-03 10:55:59,422 Tried to send message while not connected: /conf-request [[172.19.104.123,0:0:0:0:0:0:0:1%1,fe80:0:0:0:225:90ff:fe6a:d35c%2,127.0.0.1],[5.1.2,\/437054467\/conf]] INFO [StompConnection receiver] 2015-06-03 10:55:59,423 Reconnecting in 0s. INFO [StompConnection receiver] 2015-06-03 10:55:59,424 Connected to 172.19.104.123:61620 INFO [main] 2015-06-03 10:55:59,432 Starting Jetty server: {:join? false, :ssl? false, :host localhost, :port 61621} Checks with other jmx based tools (nodetool, jmxtrans) confirm that the jmx setup is correct. Any ideals ? Thank you very much! 发自 Windows 邮件
How to store denormalized data
Hi all, I am trying to store some data (user actions in our application) for future analysis (probably using Spark). I understand best practice is to store it in denormalized form, and this will definitely make some of our future queries much easier. But I have a problem with denormalizing the data. For example, let’s say one of my queries is “the number of reports generated by user type”. In the part of the application that the user connects to to generate reports, we only have access to the user id. In a traditional RDBMS, this is fine, because at query time you join the user id onto the users table and get all the user data associated with that user. But how do I populate extra fields like user type on the fly? My ideas so far: 1. I try and maintain an in-memory cache of data such as “user”, and do a lookup to this cache for every user action and store the user data with it. #PROS: fast #CONS: not scalable, will run out of memory if data sets grow 2. For each user action, I do a call to RDBMS and look up the data for the user in question, then store the user action plus the user data as a single row. #PROS easy to scale #CONS slow 3. I write only the user id and the action straight away, and have a separate batch process that periodically goes through my table looking for rows without user data, and looks up the user data from RDBMS and populates it None of these solutions seem ideal to me. Does Cassandra have something like ‘triggers’, where I can set up a table to automatically populate some rows based on a lookup from another table? Or perhaps Spark or some other library has built-in functionality that solves exactly this problem? Any suggestions much appreciated. Thanks, Matthew
Re: How to store denormalized data
Suggestion or rather food for thought Do you expect to read/analyze the written data right away? Or will it be a batch process, kicked off later in time? What I am trying to say is that if the 'read/analysis' part is a) batch process and b) kicked off later in time, then #3 is a fine solution? What harm in it? Also, you can slightly change it, (if applicable) and not populate as a separate batch process but in fact make part of your analysis job? Kind of a pre-process/prep step? Regards, Shahab On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I am trying to store some data (user actions in our application) for future analysis (probably using Spark). I understand best practice is to store it in denormalized form, and this will definitely make some of our future queries much easier. But I have a problem with denormalizing the data. For example, let’s say one of my queries is “the number of reports generated by user type”. In the part of the application that the user connects to to generate reports, we only have access to the user id. In a traditional RDBMS, this is fine, because at query time you join the user id onto the users table and get all the user data associated with that user. But how do I populate extra fields like user type on the fly? My ideas so far: 1. I try and maintain an in-memory cache of data such as “user”, and do a lookup to this cache for every user action and store the user data with it. #PROS: fast #CONS: not scalable, will run out of memory if data sets grow 2. For each user action, I do a call to RDBMS and look up the data for the user in question, then store the user action plus the user data as a single row. #PROS easy to scale #CONS slow 3. I write only the user id and the action straight away, and have a separate batch process that periodically goes through my table looking for rows without user data, and looks up the user data from RDBMS and populates it None of these solutions seem ideal to me. Does Cassandra have something like ‘triggers’, where I can set up a table to automatically populate some rows based on a lookup from another table? Or perhaps Spark or some other library has built-in functionality that solves exactly this problem? Any suggestions much appreciated. Thanks, Matthew
Re: How to interpret some GC logs
GC Logs are a weird science. I use a couple of resources to get through them. Regarding your question my 1.8.0_40 always have the first the -. I greped through 2h of logs, and on a test environment. I use the following set of options: -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 I don't know if the following could help but I use this tool to visualize my GC behaviour: https://github.com/chewiebug/GCViewer If I need insight into the GC as it goes I use Java VisualVM. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Tue, Jun 2, 2015 at 9:17 AM, Michał Łowicki mlowi...@gmail.com wrote: On Tue, Jun 2, 2015 at 9:06 AM, Sebastian Martinka sebastian.marti...@mercateo.com wrote: this should help you: https://blogs.oracle.com/poonam/entry/understanding_cms_gc_logs I don't see there such format. Passed options related to GC are: -XX:+PrintGCDateStamps -Xloggc:/var/log/cassandra/gc.log Best Regards, Sebastian Martinka *Von:* Michał Łowicki [mailto:mlowi...@gmail.com] *Gesendet:* Montag, 1. Juni 2015 11:47 *An:* user@cassandra.apache.org *Betreff:* How to interpret some GC logs Hi, Normally I get logs like: 2015-06-01T09:19:50.610+: 4736.314: [GC 6505591K-4895804K(8178944K), 0.0494560 secs] which is fine and understandable but occasionalIy I see something like: 2015-06-01T09:19:50.661+: 4736.365: [GC 4901600K(8178944K), 0.0049600 secs] How to interpret it? Does it miss only part before - so memory occupied before GC cycle? -- BR, Michał Łowicki -- BR, Michał Łowicki -- --
Re: Spark SQL JDBC Server + DSE
Kudos Ben. We¹ve been tracking Zeppelin, and considered doing the same thing. You beat us to it. Well done. -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Ben Bromhead b...@instaclustr.com Reply-To: user@cassandra.apache.org Date: Tuesday, June 2, 2015 at 5:05 PM To: user@cassandra.apache.org Subject: Re: Spark SQL JDBC Server + DSE If you want a web based notebook style approach (similar to ipython) check out https://github.com/apache/incubator-zeppelin And https://github.com/apache/incubator-zeppelin/pull/86 Bonus free pretty graphs! On 1 June 2015 at 11:41, Sebastian Estevez sebastian.este...@datastax.com wrote: Have you looked at job server? https://github.com/spark-jobserver/spark-jobserver https://www.youtube.com/watch?v=8k9ToZ4m6os http://planetcassandra.org/blog/post/fast-spark-queries-on-in-memory-datasets/ All the best, http://www.datastax.com/ Sebastián Estévez Solutions Architect | 954 905 8615 tel:954%20905%208615 | sebastian.este...@datastax.com https://www.linkedin.com/company/datastax https://www.facebook.com/datastax https://twitter.com/datastax https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax http://cassandrasummit-datastax.com/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world¹s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. On Mon, Jun 1, 2015 at 8:13 AM, Mohammed Guller moham...@glassbeam.com wrote: Brian, We haven¹t open sourced the REST server, but not opposed to doing it. Just need to carve out some time to clean up the code and carve it out from all the other stuff that we do in that REST server. Will try to do it in the next few weeks. If you need it sooner, let me know. I did consider the option of writing our own Spark SQL JDBC driver for C*, but it is lower on the priority list right now. Mohammed From: Brian O'Neill [mailto:boneil...@gmail.com] On Behalf Of Brian O'Neill Sent: Saturday, May 30, 2015 3:12 AM To: user@cassandra.apache.org Subject: Re: Spark SQL JDBC Server + DSE Any chance you open-sourced, or could open-source the REST server? ;) In thinking about it It doesn¹t feel like it would be that hard to write a Spark SQL JDBC driver against Cassandra, akin to what they have for hive: https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-t hrift-jdbcodbc-server I wouldn¹t mind collaborating on that, if you are headed in that direction. (and then I could write the REST server on top of that) LMK, -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 tel:215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Friday, May 29, 2015 at 2:15 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Brian, I implemented a similar REST server last year and it works great. Now we have a requirement to support JDBC connectivity in addition to the REST API. We want to allow users to use tools like Tableau to connect to C* through the Spark SQL JDBC/Thift server. Mohammed From: Brian O'Neill
RE: How to store denormalized data
Thanks Shahab, That was my initial thought. The downside I can think of for that approach is if/when we decide to use this data to serve suggestions in real time back to the users (in a sort of “if you clicked on this you might also like to click on this”) and the algorithms for that would need to be driven off the extra columns. Having said that, we are nowhere near that stage with our application, so I could opt for the batch approach for now and cross that bridge when we come to it! Just wondering if anyone else has already solved this in a really elegant way already :) Cheers, Matthew *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com] *Sent:* 03 June 2015 15:55 *To:* user@cassandra.apache.org *Subject:* Re: How to store denormalized data Suggestion or rather food for thought Do you expect to read/analyze the written data right away? Or will it be a batch process, kicked off later in time? What I am trying to say is that if the 'read/analysis' part is a) batch process and b) kicked off later in time, then #3 is a fine solution? What harm in it? Also, you can slightly change it, (if applicable) and not populate as a separate batch process but in fact make part of your analysis job? Kind of a pre-process/prep step? Regards, Shahab On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I am trying to store some data (user actions in our application) for future analysis (probably using Spark). I understand best practice is to store it in denormalized form, and this will definitely make some of our future queries much easier. But I have a problem with denormalizing the data. For example, let’s say one of my queries is “the number of reports generated by user type”. In the part of the application that the user connects to to generate reports, we only have access to the user id. In a traditional RDBMS, this is fine, because at query time you join the user id onto the users table and get all the user data associated with that user. But how do I populate extra fields like user type on the fly? My ideas so far: 1. I try and maintain an in-memory cache of data such as “user”, and do a lookup to this cache for every user action and store the user data with it. #PROS: fast #CONS: not scalable, will run out of memory if data sets grow 2. For each user action, I do a call to RDBMS and look up the data for the user in question, then store the user action plus the user data as a single row. #PROS easy to scale #CONS slow 3. I write only the user id and the action straight away, and have a separate batch process that periodically goes through my table looking for rows without user data, and looks up the user data from RDBMS and populates it None of these solutions seem ideal to me. Does Cassandra have something like ‘triggers’, where I can set up a table to automatically populate some rows based on a lookup from another table? Or perhaps Spark or some other library has built-in functionality that solves exactly this problem? Any suggestions much appreciated. Thanks, Matthew
Re: How to store denormalized data
Your requirement is still not quite clear - are you counting users or reports, or reports of a type for each user, or...? You can have a separate table, with the partition key being the user type, and using the user id as a clustering column - provided that the number of users is only thousands or no more than low millions. Then write a row whenever a report is generated for a given type and user ID. Do you need to count multiple instances of the same report for a given user? If so, you can use a time stamp as an additional clustering column. -- Jack Krupansky On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I am trying to store some data (user actions in our application) for future analysis (probably using Spark). I understand best practice is to store it in denormalized form, and this will definitely make some of our future queries much easier. But I have a problem with denormalizing the data. For example, let’s say one of my queries is “the number of reports generated by user type”. In the part of the application that the user connects to to generate reports, we only have access to the user id. In a traditional RDBMS, this is fine, because at query time you join the user id onto the users table and get all the user data associated with that user. But how do I populate extra fields like user type on the fly? My ideas so far: 1. I try and maintain an in-memory cache of data such as “user”, and do a lookup to this cache for every user action and store the user data with it. #PROS: fast #CONS: not scalable, will run out of memory if data sets grow 2. For each user action, I do a call to RDBMS and look up the data for the user in question, then store the user action plus the user data as a single row. #PROS easy to scale #CONS slow 3. I write only the user id and the action straight away, and have a separate batch process that periodically goes through my table looking for rows without user data, and looks up the user data from RDBMS and populates it None of these solutions seem ideal to me. Does Cassandra have something like ‘triggers’, where I can set up a table to automatically populate some rows based on a lookup from another table? Or perhaps Spark or some other library has built-in functionality that solves exactly this problem? Any suggestions much appreciated. Thanks, Matthew