Cassandra DSC installation fail due to some python dependecies. How to rectify ?
I am trying to install cassandra dsc20 but the installation fails due to some python dependecies. How could I make this work ? root@server1:~# sudo apt-get install dsc20 Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: cassandra libjna-java libopts25 ntp python python-minimal python-support python2.7 python2.7-minimal Suggested packages: libjna-java-doc ntp-doc apparmor python-doc python-tk python2.7-doc binutils binfmt-support Recommended packages: perl The following NEW packages will be installed: cassandra dsc20 libjna-java libopts25 ntp python python-minimal python-support python2.7 python2.7-minimal 0 upgraded, 10 newly installed, 0 to remove and 0 not upgraded. Need to get 17.1 MB of archives. After this operation, 23.2 MB of additional disk space will be used. Do you want to continue [Y/n]? y Get:1 http://debian.datastax.com/community/ stable/main cassandra all 2.0.5 [14.3 MB] Get:2 http://us.archive.ubuntu.com/ubuntu/ raring/main libopts25 amd64 1:5.17.1-1ubuntu2 [62.2 kB] Get:3 http://us.archive.ubuntu.com/ubuntu/ raring/main ntp amd64 1:4.2.6.p5+dfsg-2ubuntu1 [614 kB] Get:4 http://us.archive.ubuntu.com/ubuntu/ raring/universe libjna-java amd64 3.2.7-4 [416 kB] Get:5 http://us.archive.ubuntu.com/ubuntu/ raring-security/main python2.7-minimal amd64 2.7.4-2ubuntu3.2 [1223 kB] Get:6 http://debian.datastax.com/community/ stable/main dsc20 all 2.0.5-1 [1302 B] Get:7 http://us.archive.ubuntu.com/ubuntu/ raring-security/main python2.7 amd64 2.7.4-2ubuntu3.2 [263 kB] Get:8 http://us.archive.ubuntu.com/ubuntu/ raring/main python-minimal amd64 2.7.4-0ubuntu1 [30.8 kB] Get:9 http://us.archive.ubuntu.com/ubuntu/ raring/main python amd64 2.7.4-0ubuntu1 [169 kB] Get:10 http://us.archive.ubuntu.com/ubuntu/ raring/universe python-support all 1.0.15 [26.7 kB] Fetched 17.1 MB in 3s (4842 kB/s) Selecting previously unselected package libopts25. (Reading database ... 27688 files and directories currently installed.) Unpacking libopts25 (from .../libopts25_1%3a5.17.1-1ubuntu2_amd64.deb) ... Selecting previously unselected package ntp. Unpacking ntp (from .../ntp_1%3a4.2.6.p5+dfsg-2ubuntu1_amd64.deb) ... Selecting previously unselected package libjna-java. Unpacking libjna-java (from .../libjna-java_3.2.7-4_amd64.deb) ... Selecting previously unselected package python2.7-minimal. Unpacking python2.7-minimal (from .../python2.7-minimal_2.7.4-2ubuntu3.2_amd64.deb) ... Selecting previously unselected package python2.7. Unpacking python2.7 (from .../python2.7_2.7.4-2ubuntu3.2_amd64.deb) ... Selecting previously unselected package python-minimal. Unpacking python-minimal (from .../python-minimal_2.7.4-0ubuntu1_amd64.deb) ... Selecting previously unselected package python. Unpacking python (from .../python_2.7.4-0ubuntu1_amd64.deb) ... Selecting previously unselected package python-support. Unpacking python-support (from .../python-support_1.0.15_all.deb) ... Selecting previously unselected package cassandra. Unpacking cassandra (from .../cassandra_2.0.5_all.deb) ... Selecting previously unselected package dsc20. Unpacking dsc20 (from .../archives/dsc20_2.0.5-1_all.deb) ... Processing triggers for man-db ... Processing triggers for desktop-file-utils ... Setting up libopts25 (1:5.17.1-1ubuntu2) ... Setting up ntp (1:4.2.6.p5+dfsg-2ubuntu1) ... * Starting NTP server ntpd [ OK ] Setting up libjna-java (3.2.7-4) ... Setting up python2.7-minimal (2.7.4-2ubuntu3.2) ... # Empty sitecustomize.py to avoid a dangling symlink Traceback (most recent call last): File /usr/lib/python2.7/py_compile.py, line 170, in module sys.exit(main()) File /usr/lib/python2.7/py_compile.py, line 162, in main compile(filename, doraise=True) File /usr/lib/python2.7/py_compile.py, line 106, in compile with open(file, 'U') as f: IOError: [Errno 2] No such file or directory: '/usr/lib/python2.7/sitecustomize.py' dpkg: error processing python2.7-minimal (--configure): subprocess installed post-installation script returned error exit status 1 dpkg: dependency problems prevent configuration of python2.7: python2.7 depends on python2.7-minimal (= 2.7.4-2ubuntu3.2); however: Package python2.7-minimal is not configured yet. dpkg: error processing python2.7 (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of python-minimal: python-minimal depends on python2.7-minimal (= 2.7.4-1~); however: Package python2.7-minimal is not configured yet. dpkg: error processing python-minimal (--configure): dependency problems - leaving unconfigured dpkg: dependency problems prevent configuration of python: python depends on python2.7 (= 2.7.4-1~); however: Package python2.7 is not configured yet. python depends on python-minimal (= 2.7.4-0ubuntu1); however: Package python-minimal is not configured yet. dpkg: error processing python (--configure): dependency problems
Re: How do I upgrade a single cassandra node in production to 3 nodes cluster ?
I just mean increasing the cluster size not upgrading the cassandra version On Mon, Feb 17, 2014 at 2:29 AM, spa...@gmail.com wrote: By upgrade do you mean only adding nodes or also moving up the version of C*? On Mon, Feb 17, 2014 at 2:23 AM, Erick Ramirez er...@ramirez.com.auwrote: Ertio, It's not so much upgrading, but simply adding more nodes to your existing setup. Cheers, Erick On Sun, Feb 16, 2014 at 2:13 PM, Ertio Lew ertio...@gmail.com wrote: I started off with a single cassandra node on my 2GB digital ocean VPS, but now I'm planning to upgrade it to 3 node cluster. My single node contain around 10 GB data spread across 10-12 column families. What should be the strategy to upgrade that to 3 node cluster, bearing in mind that my data remains safe on this production server. -- http://spawgi.wordpress.com We can do it and do it better.
How do I upgrade a single cassandra node in production to 3 nodes cluster ?
I started off with a single cassandra node on my 2GB digital ocean VPS, but now I'm planning to upgrade it to 3 node cluster. My single node contain around 10 GB data spread across 10-12 column families. What should be the strategy to upgrade that to 3 node cluster, bearing in mind that my data remains safe on this production server.
Cassandra consuming too much memory in ubuntu as compared to within windows, same machine.
I run a development Cassandra single node server on both ubuntu windows 8 on my dual boot 4GB(RAM) machine. I see that cassandra runs fine under windows without any crashes or OOMs however in ubuntu on same machine, it always gives an OOM message *$* *sudo service cassandra start* xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms4G -Xmx4G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -Xss256k Here is the memory usage for empty cassandra server in ubuntu. *(PID)1169 (USER)cassandr (PR)20 (NI)0 (VIRT)2639m (RES)1.3g (SHR)17m S (%CPU)1 (%MEMORY)33.9 (TIME)0:53.80(COMMAND)java* The memory usage however while running under windows is very low relative to this. What is the reason behind this ? Also how can I prevent these OOMs within Ubuntu? I am running Datastax's DSC version 2.0.3.
Re: Why Solandra stores Solr data in Cassandra ? Isn't solr complete solution ?
Yes, what is Solr Cloud then for, that already provides clustering support, so what's the need for Cassandra ? On Tue, Oct 1, 2013 at 2:06 AM, Sávio Teles savio.te...@lupa.inf.ufg.brwrote: Solr's index sitting on a single machine, even if that single machine can vertically scale, is a single point of failure. And about Cloud Solr? 2013/9/30 Ken Hancock ken.hanc...@schange.com Yes. On Mon, Sep 30, 2013 at 1:57 PM, Andrey Ilinykh ailin...@gmail.comwrote: Also, be aware that while Cassandra has knobs to allow you to get consistent read results (CL=QUORUM), DSE Search does not. If a node drops messages for whatever reason, outtage, mutation, etc. its solr indexes will be inconsistent with other nodes in its replication group. Will repair fix it? -- *Ken Hancock *| System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAChttp://www.schange.com/en-US/Company/InvestorRelations.aspx Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image: LinkedIn] http://www.linkedin.com/in/kenhancock [image: SeaChange International] http://www.schange.com/This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International. -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: What is the best way to install upgrade Cassandra on Ubuntu ?
Thanks for clarifications! Btw DSC installs OpenJDK when java is not present on your system. Don't know why it doesn't include just the preferred Oracle JRE installation take care of later updates to that as well, so that could be a reason to choose DSC over official apache Debian(as that would be complete package to run cassandra), otherwise I can't see any strong reasons to prefer it !? On Fri, Oct 4, 2013 at 4:34 AM, Daniel Chia danc...@coursera.org wrote: Opscenter is a separate package: http://www.datastax.com/documentation/opscenter/3.2/webhelp/index.html?pagename=docsversion=opscenterfile=index#opsc/install/opscInstallDeb_t.html Thanks, Daniel On Tue, Oct 1, 2013 at 8:11 PM, Aaron Morton aa...@thelastpickle.comwrote: Does DSC include other things like Opscenter by default ? Not sure, I've normally installed it with an existing cluster. Would it be possible to remove any of these installations but keeping the data intact easily switch to the another, I mean switching from DSC package to apache one or vice versa ? Yes. Same code, same data. A - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 30/09/2013, at 9:58 PM, Ertio Lew ertio...@gmail.com wrote: Thanks Aaron! Does DSC include other things like Opscenter by default ? I installed DSC on linux, but Opscenter wasn't installed there but when tried on Windows it was installed along with JRE python, using the windows installer. Would it be possible to remove any of these installations but keeping the data intact easily switch to the another, I mean switching from DSC package to apache one or vice versa ? On Mon, Sep 30, 2013 at 1:10 PM, Aaron Morton aa...@thelastpickle.comwrote: I am not sure if I should use datastax's DSC or official Debian packages from Cassandra. How do I choose between them for a production server ? They are technically the same. The DSC update will come out a little after the Apache release, and I _think_ they release for every Apache release. 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? Yes if you select that when doing the package install. 2. would that smoothly replace the previous installation by itself ? Yes 3. what's the way (kindly, if you can tell the command) to upgrade ? http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#upgrade/upgradeC_c.html#concept_ds_yqj_5xr_ck 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) Above Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 27/09/2013, at 11:01 PM, Ertio Lew ertio...@gmail.com wrote: I am not sure if I should use datastax's DSC or official Debian packages from Cassandra. How do I choose between them for a production server ? On Fri, Sep 27, 2013 at 11:02 AM, Ertio Lew ertio...@gmail.com wrote: Could you please clarify that: 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? 2. would that smoothly replace the previous installation by itself ? 3. what's the way (kindly, if you can tell the command) to upgrade ? 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) On Fri, Sep 27, 2013 at 12:50 AM, Robert Coli rc...@eventbrite.comwrote: On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew ertio...@gmail.comwrote: How do you install Cassandra on Ubuntu later how do you upgrade the installation on the node when an update has arrived ? Do you simply download replace the latest tar.gz, untar it to replace the older cassandra files? How do you do it ? How does this upgrade process differ for a major version upgrade, like say switching from 1.2 series to 2.0 series ? Use the deb packages. To upgrade, install the new package. Only upgrade a single major version. and be sure to consult NEWS.txt for any upgrade caveats. Also be aware of this sub-optimal behavior of the debian packages : https://issues.apache.org/jira/browse/CASSANDRA-2356 =Rob
Re: What is the best way to install upgrade Cassandra on Ubuntu ?
Thanks Aaron! Does DSC include other things like Opscenter by default ? I installed DSC on linux, but Opscenter wasn't installed there but when tried on Windows it was installed along with JRE python, using the windows installer. Would it be possible to remove any of these installations but keeping the data intact easily switch to the another, I mean switching from DSC package to apache one or vice versa ? On Mon, Sep 30, 2013 at 1:10 PM, Aaron Morton aa...@thelastpickle.comwrote: I am not sure if I should use datastax's DSC or official Debian packages from Cassandra. How do I choose between them for a production server ? They are technically the same. The DSC update will come out a little after the Apache release, and I _think_ they release for every Apache release. 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? Yes if you select that when doing the package install. 2. would that smoothly replace the previous installation by itself ? Yes 3. what's the way (kindly, if you can tell the command) to upgrade ? http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#upgrade/upgradeC_c.html#concept_ds_yqj_5xr_ck 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) Above Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 27/09/2013, at 11:01 PM, Ertio Lew ertio...@gmail.com wrote: I am not sure if I should use datastax's DSC or official Debian packages from Cassandra. How do I choose between them for a production server ? On Fri, Sep 27, 2013 at 11:02 AM, Ertio Lew ertio...@gmail.com wrote: Could you please clarify that: 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? 2. would that smoothly replace the previous installation by itself ? 3. what's the way (kindly, if you can tell the command) to upgrade ? 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) On Fri, Sep 27, 2013 at 12:50 AM, Robert Coli rc...@eventbrite.comwrote: On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew ertio...@gmail.com wrote: How do you install Cassandra on Ubuntu later how do you upgrade the installation on the node when an update has arrived ? Do you simply download replace the latest tar.gz, untar it to replace the older cassandra files? How do you do it ? How does this upgrade process differ for a major version upgrade, like say switching from 1.2 series to 2.0 series ? Use the deb packages. To upgrade, install the new package. Only upgrade a single major version. and be sure to consult NEWS.txt for any upgrade caveats. Also be aware of this sub-optimal behavior of the debian packages : https://issues.apache.org/jira/browse/CASSANDRA-2356 =Rob
Why Solandra stores Solr data in Cassandra ? Isn't solr complete solution ?
Solr's data is stored on the file system as a set of index files[ http://stackoverflow.com/a/7685579/530153]. Then why do we need anything like Solandra or DataStax Enterprise Search? Isn't Solr complete solution in itself ? What do we need to integrate with Cassandra ?
Among Datastax community Cassandra debian package, which to choose for production install ?
I think both provide the same thing except Datastax Community also provides some extras like Opscenter, etc. But I cannot find opscenter installed when I installled DSC on ubuntu. Although on windows installation, I saw opscenter JRE as well , so I think for DSC, there is no such prerequisite for Oracle JRE as required for Cassandra debain package, is it so ? Btw which is usually preferred for production installs ? I may need to use Opscenter but just *occasionally*.
Re: What is the best way to install upgrade Cassandra on Ubuntu ?
I am not sure if I should use datastax's DSC or official Debian packages from Cassandra. How do I choose between them for a production server ? On Fri, Sep 27, 2013 at 11:02 AM, Ertio Lew ertio...@gmail.com wrote: Could you please clarify that: 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? 2. would that smoothly replace the previous installation by itself ? 3. what's the way (kindly, if you can tell the command) to upgrade ? 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) On Fri, Sep 27, 2013 at 12:50 AM, Robert Coli rc...@eventbrite.comwrote: On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew ertio...@gmail.com wrote: How do you install Cassandra on Ubuntu later how do you upgrade the installation on the node when an update has arrived ? Do you simply download replace the latest tar.gz, untar it to replace the older cassandra files? How do you do it ? How does this upgrade process differ for a major version upgrade, like say switching from 1.2 series to 2.0 series ? Use the deb packages. To upgrade, install the new package. Only upgrade a single major version. and be sure to consult NEWS.txt for any upgrade caveats. Also be aware of this sub-optimal behavior of the debian packages : https://issues.apache.org/jira/browse/CASSANDRA-2356 =Rob
What is the best way to install upgrade Cassandra on Ubuntu ?
How do you install Cassandra on Ubuntu later how do you upgrade the installation on the node when an update has arrived ? Do you simply download replace the latest tar.gz, untar it to replace the older cassandra files? How do you do it ? How does this upgrade process differ for a major version upgrade, like say switching from 1.2 series to 2.0 series ?
Re: What is the best way to install upgrade Cassandra on Ubuntu ?
Could you please clarify that: 1. when I upgrade to a newer version, would that retain my previous configurations so that I don't need to configure everything again ? 2. would that smoothly replace the previous installation by itself ? 3. what's the way (kindly, if you can tell the command) to upgrade ? 4. when should I prefer datastax's dsc to that ? (I need to install for production env.) On Fri, Sep 27, 2013 at 12:50 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Sep 26, 2013 at 12:05 PM, Ertio Lew ertio...@gmail.com wrote: How do you install Cassandra on Ubuntu later how do you upgrade the installation on the node when an update has arrived ? Do you simply download replace the latest tar.gz, untar it to replace the older cassandra files? How do you do it ? How does this upgrade process differ for a major version upgrade, like say switching from 1.2 series to 2.0 series ? Use the deb packages. To upgrade, install the new package. Only upgrade a single major version. and be sure to consult NEWS.txt for any upgrade caveats. Also be aware of this sub-optimal behavior of the debian packages : https://issues.apache.org/jira/browse/CASSANDRA-2356 =Rob
Why don't you start off with a “single small” Cassandra server as you usually do it with MySQL?
For any website just starting out, the load initially is minimal grows with a slow pace initially. People usually start with their MySQL based sites with a single server(***that too a VPS not a dedicated server) running as both app server as well as DB server usually get too far with this setup only as they feel the need they separate the DB from the app server giving it a separate VPS server. This is what a start up expects the things to be while planning about resources procurement. But so far what I have seen, it's something very different with Cassandra. People usually recommend starting out with atleast a 3 node cluster, (on dedicated servers) with lots lots of RAM. 4GB or 8GB RAM is what they suggest to start with. So is it that Cassandra requires more hardware resources in comparison to MySQL, for a website to deliver similar performance, serve similar load/ traffic same amount of data. I understand about higher storage requirements of Cassandra due to replication but what about other hardware resources ? Can't we start off with Cassandra based apps just like MySQL. Starting with 1 or 2 VPS adding more whenever there's a need ? I don't want to compare apples with oranges. I just want to know how much more dangerous situation I may be in when I start out with a single node VPS based cassandra installation Vs a single node VPS based MySQL installation. Difference between these two situations. Are cassandra servers more prone to be unavailable than MySQL servers ? What is bad if I put tomcat too along with Cassandra as people use LAMP stack on single server. - *This question is also posted at StackOverflow herehttp://stackoverflow.com/questions/18462530/why-dont-you-start-off-with-a-single-small-cassandra-server-as-you-usually has an open bounty worth +50 rep.*
Maintain backup for single node cluster
I would like to have a single node cassandra cluster initially but to maintain backups for single node how about occasionally temporarily adding a second node (one that would contain the backup, this could be my dev machine as well, far far from first node in some remote datacenter) to cluster as a replica so that data would be synchronized on both as if it were a replica. Would it be possible to do this ? May be I could do this backup once 2-3 days.
CustomTThreadPoolServer.java: Error occurred during processing of message.
I suddenly started to encounter this weird issue after writing some data to Cassandra. Didn't know exactly what was written before this or due to which this started happening. ERROR [pool-2-thread-30] 2013-08-29 19:55:24,778 CustomTThreadPoolServer.java (line 205) Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: -2147418111 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [pool-2-thread-31] 2013-08-29 19:55:24,910 CustomTThreadPoolServer.java (line 205) Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: -2147418111 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Any ideas ??
Re: CustomTThreadPoolServer.java: Error occurred during processing of message.
Running Cassandra (1.0.0 final) single node with default configurations on Windows dev machine. Using Hector. On Thu, Aug 29, 2013 at 10:50 PM, Ertio Lew ertio...@gmail.com wrote: I suddenly started to encounter this weird issue after writing some data to Cassandra. Didn't know exactly what was written before this or due to which this started happening. ERROR [pool-2-thread-30] 2013-08-29 19:55:24,778 CustomTThreadPoolServer.java (line 205) Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: - 2147418111 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ERROR [pool-2-thread-31] 2013-08-29 19:55:24,910 CustomTThreadPoolServer.java (line 205) Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: - 2147418111 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Any ideas ??
Re: Which of these VPS configurations would perform better for Cassandra ?
Amazon seems to much overprice its services. If you look out for a similar size deployment elsewhere like linode or digital ocean(very competitive pricing), you'll notice huge differences. Ok, some services features are extra but may we all don't need them necessarily when you can host on non-dedicated virtual servers on Amazon you can also do it with similar configuration nodes elsewhere too. IMO these huge costs associated with cassandra deployment are too heavy for small startups just starting out. I believe, If you consider a deployment for similar application using MySQL it should be quite cheaper/ affordable(though i'm not exactly sure). Atleast you don't usually create a cluster from the beginning. Probably we made a wrong decision to choose cassandra considering only its technological advantages.
Re: Which of these VPS configurations would perform better for Cassandra ?
@David: Like all other start-ups, we too cannot start with all dedicated servers for Cassandra. So right now we have no better choice except for using a VPS :), but we can definitely choose one from amongst a suitable set of VPS configurations. As of now since we are starting out, could we initiate our cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we wont we having a very heavy load on Cassandra until a next few months till we grow our user base. So, this choice is mainly based on the pricing vs configuration as well as digital ocean's good reputation in the community. On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.netwrote: I've run several lab configurations on linodes; I wouldn't run cassandra on any shared virtual platform for large-scale production, just because your IO performance is going to be really hard to predict. Lots of people do, though -- depends on your cassandra loads and how consistent you need to have performance be, as well as how much of your working set will fit into memory. Remember that linode significantly oversells their CPU as well. The release version of KVM, at least as of a few months ago, still doesn't support TRIM on SSD; that, plus the fact that you don't know how others will use SSDs or if their file systems will keep the SSDs healthy, means that SSD performance on KVM is going to be highly unpredictable. I have not tested digitalocean, but I did test several other KVM+SSD shared-tenant hosting providers aggressively for cassandra a couple months ago; they all failed badly. Your mileage will vary considerably based on what you need out of cassandra, what your data patterns look like, and how you configure your system. That said, I would use xen before KVM for high-performance IO. I have not run Cassandra in any volume on Amazon -- lots of folks have, and may have recommendations (including SSD) there for where it falls on the price/performance curve. --DRS On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote: I am building a cluster(initially starting with a 2-3 nodes cluster). I have came across two seemingly good options for hosting, Linode Digital Ocean. VPS configuration for both listed below: Linode:- -- XEN Virtualization 2 GB RAM 8 cores CPU (2x priority) (8 processor Xen instances) 96 GB Storage Digital Ocean:- - KVM Virtualization 2GB Memory 2 Cores 40GB **SSD Disk*** Digitial Ocean's VPS is at half price of above listed Linode VPS, Could you clarify which of these two VPS would be better as Cassandra nodes ?
Which of these VPS configurations would perform better for Cassandra ?
I am building a cluster(initially starting with a 2-3 nodes cluster). I have came across two seemingly good options for hosting, Linode Digital Ocean. VPS configuration for both listed below: Linode:- -- XEN Virtualization 2 GB RAM 8 cores CPU (2x priority) (8 processor Xen instances) 96 GB Storage Digital Ocean:- - KVM Virtualization 2GB Memory 2 Cores 40GB ***SSD *Disk*** Digitial Ocean's VPS is at half price of above listed Linode VPS, Could you clarify which of these two VPS would be better as Cassandra nodes ?
Re:
I use hector On Thu, Apr 18, 2013 at 1:35 PM, aaron morton aa...@thelastpickle.comwrote: ERROR 08:40:42,684 Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: -214741811 1 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol .java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandr This is an error when the server is trying to read what the client has sent. Is this caused due to my application putting any corrupted data? Looks that way. What client are you using ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 18/04/2013, at 3:21 PM, Ertio Lew ertio...@gmail.com wrote: I run cassandra on single win 8 machine for development needs. Everything has been working fine for several months but just today I saw this error message in cassandra logs all host pools were marked down. ERROR 08:40:42,684 Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: -214741811 1 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol .java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandr a.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process( Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.jav a:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run (CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec utor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:908) at java.lang.Thread.run(Thread.java:662) After restarting the server everything again worked fine. I am curious to know what is this related to. Is this caused due to my application putting any corrupted data?
[no subject]
I run cassandra on single win 8 machine for development needs. Everything has been working fine for several months but just today I saw this error message in cassandra logs all host pools were marked down. ERROR 08:40:42,684 Error occurred during processing of message. java.lang.StringIndexOutOfBoundsException: String index out of range: -214741811 1 at java.lang.String.checkBounds(String.java:397) at java.lang.String.init(String.java:442) at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol .java:339) at org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandr a.java:18958) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process( Cassandra.java:3441) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.jav a:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run (CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec utor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:908) at java.lang.Thread.run(Thread.java:662) After restarting the server everything again worked fine. I am curious to know what is this related to. Is this caused due to my application putting any corrupted data?
Re: Is it bad putting columns with composite or integer name in CF with ByteType comparator validator ?
Thoughts, please ? On Thu, Nov 1, 2012 at 7:12 PM, Ertio Lew ertio...@gmail.com wrote: Would that do any harm or are there any downsides, if I store columns with composite names or Integer type names in a column family with bytesType comparator validator. I have observed that bytesType comparator would also sort the integer named columns in similar fashion as done by IntegerType comparator, so why should I just lock my CF to just store Integer or composite named columns, would be good if I could just mix different datatypes in same column family, No !?
Re: Option for ordering columns by timestamp in CF
@B Todd Burruss: Regarding the use cases, I think they are pretty common. At least I see its usages very frequently in my project. Lets say when the application needs to store a timeline of bookmark activity by a user on certain items then if I could store the activity data containing columns(with concerned item id as column name) get it ordered by timestamp then I could also fetch from that row whether or not a particular item was bookmarked by user. Ordering columns by time is a very common requirement in any application therefore if such a mechanism is provided by cassandra, it would be really useful convenient to app developers. On Sat, Oct 13, 2012 at 8:50 PM, Martin Koch m...@issuu.com wrote: One example could be to identify when a row was last updated. For example, if I have a column family for storing users, the row key is a user ID and the columns are values for that user, e.g. natural column names would be firstName, lastName, address, etc; column names don't naturally include a date here. Sorting the coulmns by timestamp and picking the last would allow me to know when the row was last modified. (I could manually maintain a 'last modified' column as well, I know, but just coming up with a use case :). /Martin Koch On Fri, Oct 12, 2012 at 11:39 PM, B. Todd Burruss bto...@gmail.comwrote: trying to think of a use case where you would want to order by timestamp, and also have unique column names for direct access. not really trying to challenge the use case, but you can get ordering by timestamp and still maintain a name for the column using composites. if the first component of the composite is a timestamp, then you can order on it. when retrieved you will could have a name in the second component .. and have dupes as long as the timestamp is unique (use TimeUUID) On Fri, Oct 12, 2012 at 7:20 AM, Derek Williams de...@fyrie.net wrote: You probably already know this but I'm pretty sure it wouldn't be a trivial change, since to efficiently lookup a column by name requires the columns to be ordered by name. A separate index would be needed in order to provide lookup by column name if the row was sorted by timestamp (which is the way Redis implements it's sorted set). On Fri, Oct 12, 2012 at 12:13 AM, Ertio Lew ertio...@gmail.com wrote: Make column timestamps optional- kidding me, right ?:) I do understand that this wont be possible as then cassandra wont be able to distinguish the latest among several copies of same column. I dont mean that. I just want the while ordering the columns, Cassandra(in an optional mode per CF) should not look at column names(they will exist though but for retrieval purposes not for ordering) but instead Cassandra would order the columns by looking at the timestamp values(timestamps would exist!). So the change would be just to provide a mode in which cassandra, while ordering, uses timestamps instead of column names. On Fri, Oct 12, 2012 at 2:26 AM, Tyler Hobbs ty...@datastax.com wrote: Without thinking too deeply about it, this is basically equivalent to disabling timestamps for a column family and using timestamps for column names, though in a very indirect (and potentially confusing) manner. So, if you want to open a ticket, I would suggest framing it as make column timestamps optional. On Wed, Oct 10, 2012 at 4:44 AM, Ertio Lew ertio...@gmail.com wrote: I think Cassandra should provide an configurable option on per column family basis to do columns sorting by time-stamp rather than column names. This would be really helpful to maintain time-sorted columns without using up the column name as time-stamps which might otherwise be used to store most relevant column names useful for retrievals. Very frequently we need to store data sorted in time order. Therefore I think this may be a very general requirement not specific to just my use-case alone. Does it makes sense to create an issue for this ? On Fri, Mar 25, 2011 at 2:38 AM, aaron morton aa...@thelastpickle.com wrote: If you mean order by the column timestamp (as passed by the client) that it not possible. Can you use your own timestamps as the column name and store them as long values ? Aaron On 25 Mar 2011, at 09:30, Narendra Sharma wrote: Cassandra 0.7.4 Column names in my CF are of type byte[] but I want to order columns by timestamp. What is the best way to achieve this? Does it make sense for Cassandra to support ordering of columns by timestamp as option for a column family irrespective of the column name type? Thanks, Naren -- Tyler Hobbs DataStax -- Derek Williams
Re: Option for ordering columns by timestamp in CF
Make column timestamps optional- kidding me, right ?:) I do understand that this wont be possible as then cassandra wont be able to distinguish the latest among several copies of same column. I dont mean that. I just want the while ordering the columns, Cassandra(in an optional mode per CF) should not look at column names(they will exist though but for retrieval purposes not for ordering) but instead Cassandra would order the columns by looking at the timestamp values(timestamps would exist!). So the change would be just to provide a mode in which cassandra, while ordering, uses timestamps instead of column names. On Fri, Oct 12, 2012 at 2:26 AM, Tyler Hobbs ty...@datastax.com wrote: Without thinking too deeply about it, this is basically equivalent to disabling timestamps for a column family and using timestamps for column names, though in a very indirect (and potentially confusing) manner. So, if you want to open a ticket, I would suggest framing it as make column timestamps optional. On Wed, Oct 10, 2012 at 4:44 AM, Ertio Lew ertio...@gmail.com wrote: I think Cassandra should provide an configurable option on per column family basis to do columns sorting by time-stamp rather than column names. This would be really helpful to maintain time-sorted columns without using up the column name as time-stamps which might otherwise be used to store most relevant column names useful for retrievals. Very frequently we need to store data sorted in time order. Therefore I think this may be a very general requirement not specific to just my use-case alone. Does it makes sense to create an issue for this ? On Fri, Mar 25, 2011 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote: If you mean order by the column timestamp (as passed by the client) that it not possible. Can you use your own timestamps as the column name and store them as long values ? Aaron On 25 Mar 2011, at 09:30, Narendra Sharma wrote: Cassandra 0.7.4 Column names in my CF are of type byte[] but I want to order columns by timestamp. What is the best way to achieve this? Does it make sense for Cassandra to support ordering of columns by timestamp as option for a column family irrespective of the column name type? Thanks, Naren -- Tyler Hobbs DataStax http://datastax.com/
Re: Option for ordering columns by timestamp in CF
I think Cassandra should provide an configurable option on per column family basis to do columns sorting by time-stamp rather than column names. This would be really helpful to maintain time-sorted columns without using up the column name as time-stamps which might otherwise be used to store most relevant column names useful for retrievals. Very frequently we need to store data sorted in time order. Therefore I think this may be a very general requirement not specific to just my use-case alone. Does it makes sense to create an issue for this ? On Fri, Mar 25, 2011 at 2:38 AM, aaron morton aa...@thelastpickle.comwrote: If you mean order by the column timestamp (as passed by the client) that it not possible. Can you use your own timestamps as the column name and store them as long values ? Aaron On 25 Mar 2011, at 09:30, Narendra Sharma wrote: Cassandra 0.7.4 Column names in my CF are of type byte[] but I want to order columns by timestamp. What is the best way to achieve this? Does it make sense for Cassandra to support ordering of columns by timestamp as option for a column family irrespective of the column name type? Thanks, Naren
Re: RF on per column family basis ?
I heard that it is* not highly recommended* to create more than a single keyspace for an application or on a single cluster !? Moreover I fail to understand that why Cassandra puts this limitation to set RF on keyspace when, I guess, it makes more sense to do this on per CF basis !?
Re: Schema advice: (Single row or multiple row!?) How do I store millions of columns when I need to read a set of around 500 columns at a single read query using column names ?
Actually these columns are 1 for each entity in my application I need to query at any time columns for a list of 300-500 entities in one go.
Re: Schema advice: (Single row or multiple row!?) How do I store millions of columns when I need to read a set of around 500 columns at a single read query using column names ?
For each user in my application, I want to store a *value* that is queried by using the userId. So there is going to be one column for each user (userId as col Name *value* as col Value). Now I want to store these columns such that can efficiently read columns for atleast 300-500 users in a single read query.
Re: Schema advice: (Single row or multiple row!?) How do I store millions of columns when I need to read a set of around 500 columns at a single read query using column names ?
I want to read columns for a randomly selected list of userIds(completely random). I fetch the data using userIds(which would be used as column names in case of single row or as rowkeys incase of 1 row for each user) for a selected list of users. Assume that the application knows the list of userIds which it has to demand from DB.
Schema advice: (Single row or multiple row!?) How do I store millions of columns when I need to read a set of around 500 columns at a single read query using column names ?
I want to store hundred of millions of columns(containing id1 to id2 mappings) in the DB at any single time, retrieve a set of about 200-500 columns based on the column names(id1) if they are in single row or using rowkeys if each column is stored in a unique row. If I put them in a single row:- - disadvantage is that the no of columns is quite big, that would lead to uneven load distribution,etc. - plus factor is that I can easily read all columns I want to fetch using col names doing a single row read But if I store them each in a single row:- - I will have to read hundreds of rows(300-500 or in rare cases up to 1000) at a single time, this may lead to bad read performance(!?). - A bit less space efficient What schema should I go with ?
How to make the search by columns in range case insensitive ?
I need to make a search by names index using entity names as column names in a row. This data is split in several rows using the first 3 character of entity name as row key the remaining part as column name col value contains entity id. But there is a problem, I m storing this data in a CF using byte type comparator. I need to make case insensitive queries to retrieve 'n' no of cols column names starting from a point. Any ideas about how should I do that ?
How do I add a custom comparator class to a cassandra cluster ?
I need to add a custom comparator to a cluster, to sort columns in a certain customized fashion. How do I add the class to the cluster ?
Re: How do I add a custom comparator class to a cassandra cluster ?
Can I put this comparator class in a separate new jar(with just this single file) or is it to be appended to the original jar along with the other comparator classes? On Tue, May 15, 2012 at 12:22 AM, Tom Duffield (Mailing Lists) tom.duffield.li...@gmail.com wrote: Kirk is correct. -- Tom Duffield (Mailing Lists) Sent with Sparrow http://www.sparrowmailapp.com/?sig On Monday, May 14, 2012 at 1:41 PM, Kirk True wrote: Disclaimer: I've never tried, but I'd imagine you can drop a JAR containing the class(es) into the lib directory and perform a rolling restart of the nodes. On 5/14/12 11:11 AM, Ertio Lew wrote: I need to add a custom comparator to a cluster, to sort columns in a certain customized fashion. How do I add the class to the cluster ?
Re: How do I add a custom comparator class to a cassandra cluster ?
@Brandon : I just created a jira issue to request this type of comparator along with Cassandra. It is about a UTF8 comparator that provides case insensitive ordering of columns. See issue here : https://issues.apache.org/jira/browse/CASSANDRA-4245 On Tue, May 15, 2012 at 11:14 AM, Brandon Williams dri...@gmail.com wrote: On Mon, May 14, 2012 at 1:11 PM, Ertio Lew ertio...@gmail.com wrote: I need to add a custom comparator to a cluster, to sort columns in a certain customized fashion. How do I add the class to the cluster ? I highly recommend against doing this, because you'll be locked in to your comparator and not have an easy way out. I dare say if none of the currently available comparators meet your needs, you're doing something wrong. -Brandon
Re: Schema advice/help
@R. Verlangen: You are suggesting to keep a single row for all activities read all the columns from the row then filter, right!? If done that way (instead of keeping it in 5 rows) then I would need to retrieve 100s-200s of columns from single row rather than just 50 columns if I keep in 5 rows.. Which of these two would be better ? More columns from single row OR less columns from multiple rows ? On Tue, Mar 27, 2012 at 2:27 PM, R. Verlangen ro...@us2.nl wrote: You can just get a slice range with as start userId: and no end. 2012/3/27 Maciej Miklas mac.mik...@googlemail.com multiget would require Order Preserving Partitioner, and this can lead to unbalanced ring and hot spots. Maybe you can use secondary index on itemtype - is must have small cardinality: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ On Tue, Mar 27, 2012 at 10:10 AM, Guy Incognito dnd1...@gmail.comwrote: without the ability to do disjoint column slices, i would probably use 5 different rows. userId:itemType - activityId then it's a multiget slice of 10 items from each of your 5 rows. On 26/03/2012 22:16, Ertio Lew wrote: I need to store activities by each user, on 5 items types. I always want to read last 10 activities on each item type, by a user (ie, total activities to read at a time =50). I am wanting to store these activities in a single row for each user so that they can be retrieved in single row query, since I want to read all the last 10 activities on each item.. I am thinking of creating composite names appending itemtype : activityId(activityId is just timestamp value) but then, I don't see about how to read the last 10 activities from all itemtypes. Any ideas about schema to do this better way ? -- With kind regards, Robin Verlangen www.robinverlangen.nl
Re: Adding Long type rows to a CF containing Integer(32) type row keys, without overlapping ?
I need to use the range beyond the integer32 type range, so I am using Long to write those keys. I am afraid if this might lead to collisions with the previously stored integer keys in the same CF even if I leave out the int32 type range. On Mon, Mar 26, 2012 at 10:51 PM, aaron morton aa...@thelastpickle.comwrote: without them overlapping/disturbing each other (assuming that keys lie in above domains) ? Not sure what you mean by overlapping. 42 as a int and 42 as a long are the same key. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 25/03/2012, at 9:47 PM, Ertio Lew wrote: I have been writing rows to a CF all with integer(4 byte) keys. So my CF contains rows with keys in the entire range from Integer.MIN_VALUE to Integer.MAX_VALUE. Now I want to store Long type keys as well in this CF **without disturbing the integer keys. The range of Long type keys would be excluding the integers's range ie (-2^63 to -2^31) and (2^31 to 2^63). Would it be safe to mix the integer long keys in single CF without them overlapping/disturbing each other (assuming that keys lie in above domains) ?
Schema advice/help
I need to store activities by each user, on 5 items types. I always want to read last 10 activities on each item type, by a user (ie, total activities to read at a time =50). I am wanting to store these activities in a single row for each user so that they can be retrieved in single row query, since I want to read all the last 10 activities on each item.. I am thinking of creating composite names appending itemtype : activityId(activityId is just timestamp value) but then, I don't see about how to read the last 10 activities from all itemtypes. Any ideas about schema to do this better way ?
Adding Long type rows to a CF containing Integer(32) type row keys, without overlapping ?
I have been writing rows to a CF all with integer(4 byte) keys. So my CF contains rows with keys in the entire range from Integer.MIN_VALUE to Integer.MAX_VALUE. Now I want to store Long type keys as well in this CF **without disturbing the integer keys. The range of Long type keys would be excluding the integers's range ie (-2^63 to -2^31) and (2^31 to 2^63). Would it be safe to mix the integer long keys in single CF without them overlapping/disturbing each other (assuming that keys lie in above domains) ?
Re: Fwd: information on cassandra
I guess 2 node cluster with RF=2 might also be a starting point. Isn't it ? Are there any issues with this ? On Sun, Mar 25, 2012 at 12:20 AM, samal samalgo...@gmail.com wrote: Cassandra has distributed architecture. So 1 node does not fit into it. although it can used but you loose its benefits , ok if you are just playing around, use vm to learn how cluster communicate, handle request. To get full tolerance, redundancy and consistency minimum 3 node is required. Imp read here: http://wiki.apache.org/cassandra/ http://www.datastax.com/docs/1.0/index http://thelastpickle.com/ http://www.acunu.com/blogs/all/ On Sat, Mar 24, 2012 at 11:37 PM, Garvita Mehta garvita.me...@tcs.comwrote: its not advisable to use cassandra on single node, as its basic definition says if a node fails, data still remains in the system, atleast 3 nodes must be there while setting up a cassandra cluster. Garvita Mehta CEG - Open Source Technology Group Tata Consultancy Services Ph:- +91 22 67324756 Mailto: garvita.me...@tcs.com Website: http://www.tcs.com Experience certainty. IT Services Business Solutions Outsourcing -puneet loya **wrote: - To: user@cassandra.apache.org From: puneet loya puneetl...@gmail.com Date: 03/24/2012 06:36PM Subject: Fwd: information on cassandra hi, I m puneet, an engineering student. I would like to know that, is cassandra useful considering we just have a single node(rather a single system) having all the information. I m looking for decent response time for the database. can you please respond? Thank you , Regards, Puneet Loya =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: Using cassandra at minimal expenditures
expensive :-) I was expecting to start with 2GB nodes, if not 1GB for intial. On Thu, Mar 1, 2012 at 3:43 PM, aaron morton aa...@thelastpickle.comwrote: As others said, depends on load and traffic and all sorts of thins. if you want a number, 4Gb would me a reasonable minimum IMHO. (You may get by with less). 8Gb is about the tops. Any memory not allocated to Cassandra will be used to map files into memory. If you can get machines with 8GB ram thats a reasonable start. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/03/2012, at 1:16 AM, Maki Watanabe wrote: Depends on your traffic :-) cassandra-env.sh will try to allocate heap with following formula if you don't specify MAX_HEAP_SIZE. 1. calculate 1/2 of RAM on your system and cap to 1024MB 2. calculate 1/4 of RAM on your system and cap to 8192MB 3. pick the larger value So how about to start with the default? You will need to monitor the heap usage at first. 2012/2/29 Ertio Lew ertio...@gmail.com: Thanks, I think I don't need high consistency(as per my app requirements) so I might be fine with CL.ONE instead of quorum, so I think I'm probably going to be ok with a 2 node cluster initially.. Could you guys also recommend some minimum memory to start with ? Of course that would depend on my workload as well, but that's why I am asking for the min On Wed, Feb 29, 2012 at 7:40 AM, Maki Watanabe watanabe.m...@gmail.com wrote: If you run your service with 2 node and RF=2, your data will be replicated but your service will not be redundant. ( You can't stop both of nodes ) If your service doesn't need strong consistency ( allow cassandra returns old data after write, and possible write lost ), you can use CL=ONE for read and write to keep availability. maki -- w3m
Re: Using cassandra at minimal expenditures
@Aaron: Are you suggesting 3 nodes (rather than 2) to allow quorum operations even at the temporary loss of 1 node from cluster's reach ? I understand this but I just another question popped up in my mind, probably since I'm not much experienced managing cassandra, so I'm unaware whether it may be a usual case that some out of n nodes of my cluster may be down/unresponsive or out of cluster reach? (I, actually, considered this situation like exceptional circumstance not normal one !?) On Tue, Feb 28, 2012 at 2:34 AM, aaron morton aa...@thelastpickle.comwrote: *1. *I am wandering *what is the minimum recommended cluster size to start with*? IMHO 3 http://thelastpickle.com/2011/06/13/Down-For-Me/ A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/02/2012, at 8:17 AM, Ertio Lew wrote: Hi I'm creating an networking site using cassandra. I am wanting to host this application but initially with the lowest possible resources then slowly increasing the resources as per the service's demand need. *1. *I am wandering *what is the minimum recommended cluster size to start with*? Are there any issues if I start with as little as 2 nodes in the cluster? In that case I guess would have replication factor of 2. (this way I would require at min. 3 vps, 1 as web server the 2 for cassandra cluster, right?) *2.* Anyone using cassandra with such minimal resources in production environments ? Any experiences or difficulties encountered ? *3.* In case, you would like to recommend some hosting service suitable for me ? or if you would like to suggest some other ways to minimize the resources (actually the hosting expenses).
Re: Using cassandra at minimal expenditures
Thanks, I think I don't need high consistency(as per my app requirements) so I might be fine with CL.ONE instead of quorum, so I think I'm probably going to be ok with a 2 node cluster initially.. Could you guys also recommend some minimum memory to start with ? Of course that would depend on my workload as well, but that's why I am asking for the min On Wed, Feb 29, 2012 at 7:40 AM, Maki Watanabe watanabe.m...@gmail.comwrote: If you run your service with 2 node and RF=2, your data will be replicated but your service will not be redundant. ( You can't stop both of nodes ) If your service doesn't need strong consistency ( allow cassandra returns old data after write, and possible write lost ), you can use CL=ONE for read and write to keep availability. maki
Using cassandra at minimal expenditures
Hi I'm creating an networking site using cassandra. I am wanting to host this application but initially with the lowest possible resources then slowly increasing the resources as per the service's demand need. *1. *I am wandering *what is the minimum recommended cluster size to start with*? Are there any issues if I start with as little as 2 nodes in the cluster? In that case I guess would have replication factor of 2. (this way I would require at min. 3 vps, 1 as web server the 2 for cassandra cluster, right?) *2.* Anyone using cassandra with such minimal resources in production environments ? Any experiences or difficulties encountered ? *3.* In case, you would like to recommend some hosting service suitable for me ? or if you would like to suggest some other ways to minimize the resources (actually the hosting expenses).
Any tools like phpMyAdmin to see data stored in Cassandra ?
I have tried Sebastien's phpmyAdmin For Cassandrahttps://github.com/sebgiroux/Cassandra-Cluster-Admin to see the data stored in Cassandra in the same manner as phpMyAdmin allows. But since it makes assumptions about the datatypes of the column name/column value doesn't allow to configure the datatype data should be read as on per cf basis, I couldn't make the best use of it. Are there any similar other tools out there that can do the job better ?
Re: Any tools like phpMyAdmin to see data stored in Cassandra ?
On Mon, Jan 30, 2012 at 7:16 AM, Frisch, Michael michael.fri...@nuance.comwrote: OpsCenter? http://www.datastax.com/products/opscenter - Mike I have tried Sebastien's phpmyAdmin For Cassandrahttps://github.com/sebgiroux/Cassandra-Cluster-Admin to see the data stored in Cassandra in the same manner as phpMyAdmin allows. But since it makes assumptions about the datatypes of the column name/column value doesn't allow to configure the datatype data should be read as on per cf basis, I couldn't make the best use of it. Are there any similar other tools out there that can do the job better ? Thanks, that's a great product but unfortunately doesn't work with windows. Any tools for windows ?
Re: Using 5-6 bytes for cassandra timestamps vs 8…
It wont obviously matter in case your columns are fat but in several cases, (at least I could think of several cases) where you need to, for example, just store an integer column name empty column value. Thus 12 bytes for the column where 8 bytes is just the overhead to store timestamps doesn't look very nice. And skinny columns is a very common use-case, I believe. On Thu, Jan 19, 2012 at 1:26 PM, Maxim Potekhin potek...@bnl.gov wrote: I must have accidentally deleted all messages in this thread save this one. On the face value, we are talking about saving 2 bytes per column. I know it can add up with many columns, but relative to the size of the column -- is it THAT significant? I made an effort to minimize my CF footprint by replacing the natural column keys with integers (and translating back and forth when writing and reading). It's easy to see that in my case I achieve almost 50% storage savings and at least 30%. But if the column in question contains more than 20 bytes -- what's up with trying to save 2? Cheers Maxim On 1/18/2012 11:49 PM, Ertio Lew wrote: I believe the timestamps *on per column basis* are only required until the compaction time after that it may also work if the timestamp range could be specified globally on per SST table basis. and thus the timestamps until compaction are only required to be measure the time from the initialization of the new memtable to the point the column is written to that memtable. Thus you can easily fit that time in 4 bytes. This I believe would save atleast 4 bytes overhead for each column. Is anything related to these overheads under consideration/ or planned in the roadmap ? On Tue, Sep 6, 2011 at 11:44 AM, Oleg Anastastasyevoleganas@gmail.**comolega...@gmail.com wrote: I have a patch for trunk which I just have to get time to test a bit before I submit. It is for super columns and will use the super columns timestamp as the base and only store variant encoded offsets in the underlying columns. Could you please measure how much real benefit it brings (in real RAM consumption by JVM). It is hard to tell will it give noticeable results or not. AFAIK memory structures used for memtable consume much more memory. And 64-bit JVM allocates memory aligned to 64-bit word boundary. So 37% of memory consumption reduction looks doubtful.
Re: Using 5-6 bytes for cassandra timestamps vs 8…
I believe the timestamps *on per column basis* are only required until the compaction time after that it may also work if the timestamp range could be specified globally on per SST table basis. and thus the timestamps until compaction are only required to be measure the time from the initialization of the new memtable to the point the column is written to that memtable. Thus you can easily fit that time in 4 bytes. This I believe would save atleast 4 bytes overhead for each column. Is anything related to these overheads under consideration/ or planned in the roadmap ? On Tue, Sep 6, 2011 at 11:44 AM, Oleg Anastastasyev olega...@gmail.com wrote: I have a patch for trunk which I just have to get time to test a bit before I submit. It is for super columns and will use the super columns timestamp as the base and only store variant encoded offsets in the underlying columns. Could you please measure how much real benefit it brings (in real RAM consumption by JVM). It is hard to tell will it give noticeable results or not. AFAIK memory structures used for memtable consume much more memory. And 64-bit JVM allocates memory aligned to 64-bit word boundary. So 37% of memory consumption reduction looks doubtful.
Re: Composite column names: How much space do they occupy ?
Sorry I forgot to tell that I'm using Hector to communicate with Cassandra. CS.toByteBuffer is to convert the composite type name to ByteBuffer. Can anyone aware of Hector API enlighten me why am I seeing this size for the composite type names. On Mon, Jan 2, 2012 at 2:52 PM, aaron morton aa...@thelastpickle.comwrote: What is the definition of the composite type and what is CS.toByteBuffer ? CompositeTypes have a small overhead see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/marshal/CompositeType.java Hope that helps. Aaron - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 2/01/2012, at 6:25 PM, Ertio Lew wrote: I am storing composite column names which are made up of two integer components. However I am shocked after seeing the storage overhead of these. I just tried out a composite name (with single integer component): Composite composite = new Composite(); composite.addComponent(-165376575,is); System.out.println(CS.toByteBuffer( composite ).array().length); // the result is 256 After writing then reading back this composite column from cassandra: System.out.println(CS.toByteBuffer( readColumn.getName() ).array().length); // the result is 91 How much is the storage overhead, as I am quite sure that I'm making a mistake in realizing the actual values ?
Re: Composite column names: How much space do they occupy ?
Yes that makes a lot of sense! on using remaining() method I see the proper expected sizes. On Mon, Jan 2, 2012 at 5:26 PM, Sylvain Lebresne sylv...@datastax.comwrote: I am not familiar enough with Hector to tell you if it is doing something special here, but note that: 1) you may have better luck getting that kind of question answered quickly by using the Hector mailing list. 2) that may or may not change what you're seeing (since again I don't know what Hector is actually doing), but bb.array().length is not a reliable way to get the effective length of a ByteBuffer, as it is perfectly legit to have a byte buffer only use parts of it's underlying array. You should use the remaining() method instead. -- Sylvain On Mon, Jan 2, 2012 at 12:29 PM, Ertio Lew ertio...@gmail.com wrote: Sorry I forgot to tell that I'm using Hector to communicate with Cassandra. CS.toByteBuffer is to convert the composite type name to ByteBuffer. Can anyone aware of Hector API enlighten me why am I seeing this size for the composite type names. On Mon, Jan 2, 2012 at 2:52 PM, aaron morton aa...@thelastpickle.com wrote: What is the definition of the composite type and what is CS.toByteBuffer ? CompositeTypes have a small overhead see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/marshal/CompositeType.java Hope that helps. Aaron - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 2/01/2012, at 6:25 PM, Ertio Lew wrote: I am storing composite column names which are made up of two integer components. However I am shocked after seeing the storage overhead of these. I just tried out a composite name (with single integer component): Composite composite = new Composite(); composite.addComponent(-165376575,is); System.out.println(CS.toByteBuffer( composite ).array().length); // the result is 256 After writing then reading back this composite column from cassandra: System.out.println(CS.toByteBuffer( readColumn.getName() ).array().length); // the result is 91 How much is the storage overhead, as I am quite sure that I'm making a mistake in realizing the actual values ?
Composite column names: How much space do they occupy ?
I am storing composite column names which are made up of two integer components. However I am shocked after seeing the storage overhead of these. I just tried out a composite name (with single integer component): Composite composite = new Composite(); composite.addComponent(-165376575,is); System.out.println(CS.toByteBuffer( composite ).array().length); // the result is 256 After writing then reading back this composite column from cassandra: System.out.println(CS.toByteBuffer( readColumn.getName() ).array().length); // the result is 91 How much is the storage overhead, as I am quite sure that I'm making a mistake in realizing the actual values ?
Doubts related to composite type column names/values
With regard to the composite columns stuff in Cassandra, I have the following doubts : 1. What is the storage overhead of the composite type column names/values, and 2. what exactly is the difference between the DynamicComposite and Static Composite ?
Re: Second Cassandra users survey
Provide an option to sort columns by timestamp i.e, in the order they have been added to the row, with the facility to use any column names. On Wed, Nov 2, 2011 at 4:29 AM, Jonathan Ellis jbel...@gmail.com wrote: Hi all, Two years ago I asked for Cassandra use cases and feature requests. [1] The results [2] have been extremely useful in setting and prioritizing goals for Cassandra development. But with the release of 1.0 we've accomplished basically everything from our original wish list. [3] I'd love to hear from modern Cassandra users again, especially if you're usually a quiet lurker. What does Cassandra do well? What are your pain points? What's your feature wish list? As before, if you're in stealth mode or don't want to say anything in public, feel free to reply to me privately and I will keep it off the record. [1] http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01148.html [2] http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg01446.html [3] http://www.mail-archive.com/dev@cassandra.apache.org/msg01524.html -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Retreiving column by names Vs by range, which is more performant ?
Retrieving columns by names vs by range which is more performant , when you have the options to do both ?
Re: Newbie question - fetching multiple columns of different datatypes and conversion from byte[]
Should the different datatype col values or names be first read as byte buffer then converted to appropriate type using Hector's provided serializers api like the way shown below ? ByteBuffer bb; .. String s= StringSerializer.get().fromByteBuffer(bb); Or are there any better ways ?
Re: Cassandra Cluster Admin - phpMyAdmin for Cassandra
Thanks so much SebWajam for this great piece of work! Is there a way to set a data type for displaying the column names/ values of a CF ? It seems that your project always uses String Serializer for any piece of data however most of the times in real world cases this is not true so can we anyhow configure what serializer to use while reading the data so that the data may be properly identified by your project delivered in a readable format ? On Mon, Aug 22, 2011 at 7:17 AM, SebWajam sebast...@wajam.com wrote: Hi, I'm working on this project for a few months now and I think it's mature enough to post it here: Cassandra Cluster Admin on GitHubhttps://github.com/sebgiroux/Cassandra-Cluster-Admin Basically, it's a GUI for Cassandra. If you're like me and used MySQL for a while (and still using it!), you get used to phpMyAdmin and its simple and easy to use user interface. I thought it would be nice to have a similar tool for Cassandra and I couldn't find any, so I build my own! Supported actions: - Keyspace manipulation (add/edit/drop) - Column Family manipulation (add/edit/truncate/drop) - Row manipulation on column family and super column family (insert/edit/remove) - Basic data browser to navigate in the data of a column family (seems to be the favorite feature so far) - Support Cassandra 0.8+ atomic counters - Support management of multiple Cassandra clusters Bug report and/or pull request are always welcome! -- View this message in context: Cassandra Cluster Admin - phpMyAdmin for Cassandrahttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Cluster-Admin-phpMyAdmin-for-Cassandra-tp6709930p6709930.html Sent from the cassandra-u...@incubator.apache.org mailing list archivehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/at Nabble.com.
ByteBuffer as an initial serializer to read columns with mixed datatypes ?
I have a mix of byte[] Integer column names/ values within a CF rows. So should ByteBuffer be my initial choice for the serializer while making the read query to the database for the mixed datatypes then I should retrieve the byte[] or Integer from ByteBuffer using the ByteBuffer api's getInt() method ? Is this a preferable way to read columns with integer/ byte[] names, initially as bytebuffer(s) later converting them to Integer or byte[] ?
Re: Authentication setup
Hey, I'm too looking out for a similar thing. I guess this is a very common requirement may be soon provided as built-in functionality packed with cassandra setup. Btw nice to see if someone has ideas about how to implement this for now. On Fri, Oct 21, 2011 at 6:53 PM, Alexander Konotop alexander.kono...@gmail.com wrote: Hello :-) Does anyone have a working config with normal secure authentication? I've just installed Cassandra 1.0.0 and see that SimpleAuthenticate is meant to be non-secure and was moved to examples. I need a production config - so I've tried to write this to config: authenticator: org.apache.cassandra.auth.AuthenticatedUser authority: org.apache.cassandra.auth.AuthenticatedUser But during cassandra startup log says: org.apache.cassandra.config.ConfigurationException: No default constructor for authenticator class 'org.apache.cassandra.auth.AuthenticatedUser'. As I understand either AuthenticatedUser is a wrong class or I simply don't know how to set it up - does it need additional configs similar to access.properties or passwd.properties? Maybe there's a way to store users in cassandra DB itself, like, fore example, MySQL does? I've searched and tried lot of things the whole day but the only info that I found were two phrases - first told that SimpleAuth is just a toy and second told to look into source to look for more auth methods. But, for example, this: package org.apache.cassandra.auth; import java.util.Collections; import java.util.Set; /** * An authenticated user and her groups. */ public class AuthenticatedUser { public final String username; public final SetString groups; public AuthenticatedUser(String username) { this.username = username; this.groups = Collections.emptySet(); } public AuthenticatedUser(String username, SetString groups) { this.username = username; this.groups = Collections.unmodifiableSet(groups); } @Override public String toString() { return String.format(#User %s groups=%s, username, groups); } } tells me just about nothing :-( Best regards Alexander
Using counters in 0.8
I am using Hector for a project wanted to try out using counters with latest 0.8 v Cassandra. How do we work with counters in 0.8 version ? Any web-links to such examples are appreciated. Has Hector started to provide API for that ?
Columns values(integer) need frequent updates/ increments
Hi, I am working on a Question/Answers web app using Cassandra(consider very similar to StackOverflow sites). I need to built the reputation system for users on the application. This way the user's reputation increases when s/he answers correctly somebody's question. Thus if I keep the reputation score of users as column values, these columns are very very frequently updated. Thus I have several versions of a single column which I guess is very bad. Similarly for the questions as well, the no of up-votes will increase very very frequently and hence again I'll get several versions of same column. How should I try to minimize this ill effect? ** What I thought of.. Try using a separate CF for reputation system, so that the memtable stores most of the columns(containing reputation scores of the users). Thus frequent updates will update the column in the memtable, which means more easier reads as well as updates. These reputations columns are anyways small do not explode in numbers(they only replace another column).
Is it possible to get just a count of the no of columns in a row, in efficient manner ?
Can I get just a count of the no of columns in a row without deserializing all columns in row? Or should the usage of a counter column be preferred that maintains the no of columns currently present in the row, for the situations when the total count value is most frequently used than reading the actual columns ?
Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
Hi Ryan, I am considering snowflake as an option for my usage with Cassandra for a distributed application. As I came to know snowflake uses 64 bits IDs. I am looking for a solution that could help me generate 64 bits Ids but in those 64 bits I would like at least 4 free bits so that I could manipulate with those free bits to distinguish the two rows for a same entity(split by kind of data) in same column family. If I could keep the snowflake's Id size to around 60 bits, that would be great for my use case. Is it possible to manipulate the bits safely to around 60 bits? Perhaps the microsecond precision is not required to that much depth for my use case. Any kind of suggestions would be appreciated. Best Regards Ertio Lew On Fri, Feb 4, 2011 at 1:09 AM, Ryan King r...@twitter.com wrote: You could also consider snowflake: http://github.com/twitter/snowflake which gives you ids that roughly sort by time (but aren't sequential). -ryan On Thu, Feb 3, 2011 at 11:13 AM, Matthew E. Kennedy matt.kenn...@spadac.com wrote: Unless you need your user identifiers to be sequential for some reason, I would save yourself the headache of this kind of complexity and just use UUIDs if you have to generate an identifier. On Feb 3, 2011, at 2:03 PM, Aklin_81 wrote: Hi all, To generate new keys/ UserIds for new users on my application, I am thinking of using a simple synchronized counter that can keep track of the no. of users registered on my application and when a new user signs up, he can be allotted the next available id. Since Cassandra is eventually consistent, Is this advisable to implement with Cassandra, but then I could also use stronger consistency level like quorum or all for this purpose. Please let me know your thoughts and suggesttions.. Regards Asil -- @rk
Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up
On Tue, Mar 1, 2011 at 1:26 AM, Aaron Morton aa...@thelastpickle.com wrote: This is mostly from memory. But the last 12 ? (4096 decimal) bits are a counter for the number of id's generated in a particular millisecond for that server. You could use the high 4 bits in that range for your data type flags and the low 8 for the counter. So then I would be able to generate a maximum of upto 256 Ids per millisecond (or 256000 per second) on one machine!? Seems like a very good limit for my use case. I dont think I would ever need beyond that since my write volumes are quite below as compared to that limit.. Should I go for it or still are there any other things to consider ? Aaron On 1/03/2011, at 4:41 AM, Ertio Lew ertio...@gmail.com wrote: Hi Ryan, I am considering snowflake as an option for my usage with Cassandra for a distributed application. As I came to know snowflake uses 64 bits IDs. I am looking for a solution that could help me generate 64 bits Ids but in those 64 bits I would like at least 4 free bits so that I could manipulate with those free bits to distinguish the two rows for a same entity(split by kind of data) in same column family. If I could keep the snowflake's Id size to around 60 bits, that would be great for my use case. Is it possible to manipulate the bits safely to around 60 bits? Perhaps the microsecond precision is not required to that much depth for my use case. Any kind of suggestions would be appreciated. Best Regards Ertio Lew On Fri, Feb 4, 2011 at 1:09 AM, Ryan King r...@twitter.com wrote: You could also consider snowflake: http://github.com/twitter/snowflake which gives you ids that roughly sort by time (but aren't sequential). -ryan On Thu, Feb 3, 2011 at 11:13 AM, Matthew E. Kennedy matt.kenn...@spadac.com wrote: Unless you need your user identifiers to be sequential for some reason, I would save yourself the headache of this kind of complexity and just use UUIDs if you have to generate an identifier. On Feb 3, 2011, at 2:03 PM, Aklin_81 wrote: Hi all, To generate new keys/ UserIds for new users on my application, I am thinking of using a simple synchronized counter that can keep track of the no. of users registered on my application and when a new user signs up, he can be allotted the next available id. Since Cassandra is eventually consistent, Is this advisable to implement with Cassandra, but then I could also use stronger consistency level like quorum or all for this purpose. Please let me know your thoughts and suggesttions.. Regards Asil -- @rk
Specifying row caching on per query basis ?
Is there any way to specify on per query basis(like we specify the Consistency level), what rows be cached while you're reading them, from a row_cache enabled CF. I believe, this could lead to much more efficient use of the cache space!!( if you use same data for different features/ parts in your application which have different caching needs).
Re: Specifying row caching on per query basis ?
Is this under consideration for future releases ? or being thought about!? On Thu, Feb 10, 2011 at 12:56 AM, Jonathan Ellis jbel...@gmail.com wrote: Currently there is not. On Wed, Feb 9, 2011 at 12:04 PM, Ertio Lew ertio...@gmail.com wrote: Is there any way to specify on per query basis(like we specify the Consistency level), what rows be cached while you're reading them, from a row_cache enabled CF. I believe, this could lead to much more efficient use of the cache space!!( if you use same data for different features/ parts in your application which have different caching needs). -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Merging the rows of two column families(with similar attributes) into one ??
Thanks for adding up Benjamin! On Wed, Feb 9, 2011 at 1:40 AM, Benjamin Coverston ben.covers...@datastax.com wrote: On 2/4/11 11:58 PM, Ertio Lew wrote: Yes, a disadvantage of more no. of CF in terms of memory utilization which I see is: - if some CF is written less often as compared to other CFs, then the memtable would consume space in the memory until it is flushed, this memory space could have been much better used by a CF that's heavily written and read. And if you try to make the thresholds for flush smaller then more compactions would be needed. One more disadvantage here is that with CFs that vary widely in the write rate you can also end up with fragmented commit logs which in some cases we have seen actually fill up the commit log partition. As a consequence one thing to consider would be to lower the commit log flush threshold (in minutes) to something lower for the column families that do not see heavy use. On Sat, Feb 5, 2011 at 11:58 AM, Ertio Lewertio...@gmail.com wrote: Thanks Tyler ! I could not fully understand the reason why more no of column families would mean more memory.. if you have under control parameters like memtable_throughput memtable_operations which are set per column family basis then you can directly control adjust by splitting the memory space between two CFs in proportion to what you would do in single CF. Hence there should be no extra memory consumption for multiple CFs that have been split from single one?? Regarding the compactions, I think even if they are more the size of the SST files to be compacted is smaller as the data has been split into two. Then more compactions but smaller too!! Then, provided the same amount of data, how can greater no of column families could be a bad option(if you split the values of parameters for memory consumption proportionately) ?? -- Regards, Ertio On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbsty...@datastax.com wrote: I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur This is primarily true, but not in every case. But the caching requirements may be different as they cater to two different features. This is a great reason to *not* merge them. Besides the key and row caches, don't forget about the OS buffer cache. Is it recommended to merge these two column families into one ?? Thoughts ? No, this sounds like an anti-pattern to me. The overhead from having two separate CFs is not that high. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Merging the rows of two column families(with similar attributes) into one ??
Thanks Tyler! I think I'll have to very carefully take into consideration all these factors before deciding upon how to split my data into CFs, as this cannot an objective answer. I am expecting around atleast 8 column families for my entire application, if I split the data strictly according to the various features and requirements of the application. I think there should have been provision for specifying on per query basis, what rows be cached while you're reading them, from a row_cache enabled CF. Thus you could easily merge similar data for different features of your application in a single CF. I believe, this would have also lead to much more efficient use of the cache space!!( if you were using same data for different parts in your app which have different caching needs) Regards, Ertio On Sun, Feb 6, 2011 at 1:22 AM, Tyler Hobbs ty...@datastax.com wrote: if you have under control parameters like memtable_throughput memtable_operations which are set per column family basis then you can directly control adjust by splitting the memory space between two CFs in proportion to what you would do in single CF. Hence there should be no extra memory consumption for multiple CFs that have been split from single one?? Yes, I think you have the right idea here. This is a small amount of overhead for the extra memtable and keeping track of a second set of indexes, bloom filters, sstables, etc. Regarding the compactions, I think even if they are more the size of the SST files to be compacted is smaller as the data has been split into two. Then more compactions but smaller too!! Yes. if some CF is written less often as compared to other CFs, then the memtable would consume space in the memory until it is flushed, this memory space could have been much better used by a CF that's heavily written and read. And if you try to make the thresholds for flush smaller then more compactions would be needed. If you merge the two CFs together, then updates to the 'less freqent' rows will still consume memory, only it will all be within one memtable. (Memtables grow in size until they are flushed, they don't reserve some set amount of memory.) Furthermore, because your memtables will be filled up by the 'more frequent' rows, the 'less frequent' rows will get fewer updates/overwrites in memory, so they will tend to be spread across a greater number of SSTables. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Merging the rows of two column families(with similar attributes) into one ??
I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur thus I am trying to reduce the no. of column families by adding the rows of other Column families(with similar attributes) as separate rows into one. I have two kinds of data for two separate features on my application. If I store them in two different column families then both of them will have similar attributes like same comparator type sorting needs. Thus I can also merge both of them in one column family, just by adding the rows of another to this one(increasing the no of rows). However some rows of 1st kind of data are very frequently used and rows of 2nd data are less freq. used. But I dont think this will be a problem as I am not merging two rows into one, but just adding them as separate rows in the column family. 1st kind of data has wider rows and 2nd kind of data has very less wide rows. But the caching requirements may be different as they cater to two different features.(but I think it is even advantageous since resources are free to be utilized by any data that's more frequently used) Is it recommended to merge these two column families into one ?? Thoughts ? -- Ertio
Re: Merging the rows of two column families(with similar attributes) into one ??
Thanks Tyler ! I could not fully understand the reason why more no of column families would mean more memory.. if you have under control parameters like memtable_throughput memtable_operations which are set per column family basis then you can directly control adjust by splitting the memory space between two CFs in proportion to what you would do in single CF. Hence there should be no extra memory consumption for multiple CFs that have been split from single one?? Regarding the compactions, I think even if they are more the size of the SST files to be compacted is smaller as the data has been split into two. Then more compactions but smaller too!! Then, provided the same amount of data, how can greater no of column families could be a bad option(if you split the values of parameters for memory consumption proportionately) ?? -- Regards, Ertio On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbs ty...@datastax.com wrote: I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur This is primarily true, but not in every case. But the caching requirements may be different as they cater to two different features. This is a great reason to *not* merge them. Besides the key and row caches, don't forget about the OS buffer cache. Is it recommended to merge these two column families into one ?? Thoughts ? No, this sounds like an anti-pattern to me. The overhead from having two separate CFs is not that high. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Merging the rows of two column families(with similar attributes) into one ??
Yes, a disadvantage of more no. of CF in terms of memory utilization which I see is: - if some CF is written less often as compared to other CFs, then the memtable would consume space in the memory until it is flushed, this memory space could have been much better used by a CF that's heavily written and read. And if you try to make the thresholds for flush smaller then more compactions would be needed. On Sat, Feb 5, 2011 at 11:58 AM, Ertio Lew ertio...@gmail.com wrote: Thanks Tyler ! I could not fully understand the reason why more no of column families would mean more memory.. if you have under control parameters like memtable_throughput memtable_operations which are set per column family basis then you can directly control adjust by splitting the memory space between two CFs in proportion to what you would do in single CF. Hence there should be no extra memory consumption for multiple CFs that have been split from single one?? Regarding the compactions, I think even if they are more the size of the SST files to be compacted is smaller as the data has been split into two. Then more compactions but smaller too!! Then, provided the same amount of data, how can greater no of column families could be a bad option(if you split the values of parameters for memory consumption proportionately) ?? -- Regards, Ertio On Sat, Feb 5, 2011 at 10:43 AM, Tyler Hobbs ty...@datastax.com wrote: I read somewhere that more no of column families is not a good idea as it consumes more memory and more compactions to occur This is primarily true, but not in every case. But the caching requirements may be different as they cater to two different features. This is a great reason to *not* merge them. Besides the key and row caches, don't forget about the OS buffer cache. Is it recommended to merge these two column families into one ?? Thoughts ? No, this sounds like an anti-pattern to me. The overhead from having two separate CFs is not that high. -- Tyler Hobbs Software Engineer, DataStax Maintainer of the pycassa Cassandra Python client library
Re: Can a same key exists for two rows in two different column families without clashing ?
Thanks Stephen for the Great Explanation! On Wed, Feb 2, 2011 at 4:31 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: On 2 February 2011 10:03, Ertio Lew ertio...@gmail.com wrote: Can a same key exists for two rows in two different column families without clashing ? Other words, does the same algorithm needs to enforced for generating keys for different column families or can different algorithms(for generating keys) be enforced on column family basis? I have tried out that they can, but I wanted to know if there may be any problems associated with this. Thanks. Ertio Lew it is a bad analogy for many reasons but if you replace row key with primary key and column family with table then you might get an answer. a better analogy is to think of the following. public class Keyspace { public final MapString,MapString,byte[] columnFamily1; public final MapString,MapString,byte[] columnFamily2; public final MapString,MapString,MapString,byte[] superColumnFamily3; } (still not quite correct, but mostly so for our purposes); you are asking given Keyspace keyspace; String key1 = makeKeyAlg1(); keyspace.columnFamily1.put(key1,...); String key2 = makeKeyAlg2(); keyspace.columnFamily2.put(key2,...); when key1.equals(key2) then is there a problem? They are two separate maps... why would there be. -Stephen
Re: Is it recommended to store two types of data (not related to each other but need to be retrieved together) in one super column family ?
Could someone please point me in right direction by commenting on the above ideas ? On Fri, Jan 28, 2011 at 11:50 PM, Ertio Lew ertio...@gmail.com wrote: Hi, I have two kinds of data that I would like to fit in one super column family; I am trying this, for the reasons of implementing fast database retrievals by combining the data of two rows into just one row. First kind of data, in supercolumn family, is named with timeUUIDs as supercolumn names; Think of this as, the postIds of posts in a Group. These posts will need to be sorted by time (so that list of latest posts is retrieved). Thus each post has one supercolumn each with name as (timeUUID+userID) and sorted by timeUUIDtype. Second kind of data would be just a single supercolumn containing columns of userId of all members in a group(very small). (The no of members in group will be around 40-50 max). The name of this single supercolumn may be kept suitable(perhaps max. time in future ) so as to keep this supercolumn to the beginning. (The supercolumns are required as we need to store some additional data in the columns of 1st kind of data). So is it recommended to store these two types of data (not related to each other but need to be retrieved together) in one super column family ?
Is it recommended to store two types of data (not related to each other but need to be retrieved together) in one super column family ?
Hi, I have two kinds of data that I would like to fit in one super column family; I am trying this, for the reasons of implementing fast database retrievals by combining the data of two rows into just one row. First kind of data, in supercolumn family, is named with timeUUIDs as supercolumn names; Think of this as, the postIds of posts in a Group. These posts will need to be sorted by time (so that list of latest posts is retrieved). Thus each post has one supercolumn each with name as (timeUUID+userID) and sorted by timeUUIDtype. Second kind of data would be just a single supercolumn containing columns of userId of all members in a group(very small). (The no of members in group will be around 40-50 max). The name of this single supercolumn may be kept suitable(perhaps max. time in future ) so as to keep this supercolumn to the beginning. (The supercolumns are required as we need to store some additional data in the columns of 1st kind of data). So is it recommended to store these two types of data (not related to each other but need to be retrieved together) in one super column family ?
Re: What is be the best possible client option available to a PHP developer for implementing an application ready for production environments ?
I think we might need to go with full Java implementation only, in that case, to live up with Hector as we do not find any other better option. @Dave: Thanks for the links but we wouldn't much prefer to go with thrift implementation because of frequently changing api and other complexities there. Also we would not like to lock ourselves with implementation in a language with a client option that has limitations that we can bear now but not necessarily in future. If anybody else has a better solution to this please let me know. Thank you all. Ertio Lew On Tue, Jan 18, 2011 at 2:49 PM, Dave Gardner dave.gard...@imagini.net wrote: I can't comment of phpcassa directly, but we use Cassandra plus PHP in production without any difficulties. We are happy with the performance. Most of the information we needed to get started we found here: https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP This includes details on how to compile the native PHP C Extension for Thrift. We use a bespoke client which wraps the Thrift interface. You may be better of with a higher level client, although when we were starting out there was less of a push away from Thrift directly. I found using Thrift useful as you gain an appreciation for what calls Cassandra actually supports. One potential advantage of using a higher level client is that it may protect you from the frequent Thrift interface changes which currently seem to accompany every major release. Dave On Tuesday, 18 January 2011, Tyler Hobbs ty...@riptano.com wrote: 1. ) Is it devloped to the level in order to support all the necessary features to take full advantage of Cassandra? Yes. There aren't some of the niceties of pycassa yet, but you can do everything that Cassandra offers with it. 2. ) Is it used in production by anyone ? Yes, I've talked to a few people at least who are using it in production. It tends to play a limited role instead of a central one, though. 3. ) What are its limitations? Being written in PHP. Seriously. The lack of universal 64bit integer support can be problematic if you don't have a fully 64bit system. PHP is fairly slow. PHP makes a few other things less easy to do. If you're doing some pretty lightweight interaction with Cassandra through PHP, these might not be a problem for you. - Tyler -- *Dave Gardner* Technical Architect [image: imagini_58mmX15mm.png] [image: VisualDNA-Logo-small.png] *Imagini Europe Limited* 7 Moor Street, London W1D 5NB [image: phone_icon.png] +44 20 7734 7033 [image: skype_icon.png] daveg79 [image: emailIcon.png] dave.gard...@imagini.net [image: icon-web.png] http://www.visualdna.com Imagini Europe Limited, Company number 5565112 (England and Wales), Registered address: c/o Bird Bird, 90 Fetter Lane, London, EC4A 1EQ, United Kingdom
Do you have a site in production environment with Cassandra? What client do you use?
Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio
Re: Do you have a site in production environment with Cassandra? What client do you use?
what is the technology stack do you use? On 1/14/11, Ran Tavory ran...@gmail.com wrote: I use Hector, if that counts. .. On Jan 14, 2011 7:25 PM, Ertio Lew ertio...@gmail.com wrote: Hey, If you have a site in production environment or considering so, what is the client that you use to interact with Cassandra. I know that there are several clients available out there according to the language you use but I would love to know what clients are being used widely in production environments and are best to work with(support most required features for performance). Also preferably tell about the technology stack for your applications. Any suggestions, comments appreciated ? Thanks Ertio