from:"Edward Capriolo"

Re: Cluster per Application vs. Multi-Application Clusters

2012-08-22 Thread Edward Capriolo

If you are staring out small one logical/physical cluster is probably
the best and only approach.

Long term this is very case by case dependent but I generally believe
Cluster per Application is the best approach. Although I consider it
Cluster per QOS

For our use cases I find that two applications have very different
data sizes and quality of service requirements. For example, one
application may have a small dataset size and a high repeated read/
cache hit rate scenario. While another application may have a large
sparse dataset and a random read pattern. Also one application may
demand fast  3 ms reads while the other may find 10 or 20 ms reads
acceptable.

When those two applications are placed on the same set of hardware you
end up scaling them both even though at a given time only one or the
other needs to be scaled. In extreme cases application 1 and 2 cause
contention and make each other unhappy.

What is best to do is architect your systems in such a way that moving
an individual column family to a new set of hardware is not difficult.
This might involve something map reduce program that can bulk load
existing data between two clusters, while your front end application
can send the write/updates/deletes to both the old an the new cluster.
Also make sure your application does not have too many hard coded
touch points that assume a single cluster.

As you mentioned one thing gained from keeping everything in the same
keyspace is connection pooling. However unlike a RDBMS world where
coordinated transactions have to happen in order, etc, etc that is not
the case with C* so getting all data into the same physical system
is not as important.



On Wed, Aug 22, 2012 at 8:25 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 Just an opinion here as we are having to do this ourselves loading tons of 
 researchers datasets into one clusters.  We are going the path of one 
 keyspace as it makes it easier if you ever want to mine the data so you don't 
 have to keep building different clients for another keyspace.  We ended up 
 adding our own security layer as well so researchers can expose their 
 datasets to other researchers and once exposed, other researchers can join 
 that data with their existing data.

 This of course is just one use case, but if 10 applications use cassandra, 
 you still may find a benefit in having an 11th data mining app look at the 
 data from all 10 apps.

 Later,
 Dean

 playOrm Developer

 From: Ersin Er ersin...@gmail.commailto:ersin...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Wednesday, August 22, 2012 12:44 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Cluster per Application vs. Multi-Application Clusters

 Hi all,

 What are the advantages of allocating a cluster for a single application vs 
 running multiple applications on the same cassandra cluster? Is any of the 
 models suggested over the other?

 Thanks.

 --
 Ersin Er

Re: Automating nodetool repair

2012-08-28 Thread Edward Capriolo

You can consider adding -pr. When iterating through all your hosts
like this. -pr means primary range, and will do less duplicated work.

On Mon, Aug 27, 2012 at 8:05 PM, Aaron Turner synfina...@gmail.com wrote:
 I use cron.  On one box I just do:

 for n in node1 node2 node3 node4 ; do
nodetool -h $n repair
sleep 120
 done

 A lot easier then managing a bunch of individual crontabs IMHO
 although I suppose I could of done it with puppet, but then you always
 have to keep an eye out that your repairs don't overlap over time.

 On Mon, Aug 27, 2012 at 4:52 PM, Edward Sargisson
 edward.sargis...@globalrelay.net wrote:
 Hi all,
 So nodetool repair has to be run regularly on all nodes. Does anybody have
 any interesting strategies or tools for doing this or is everybody just
 setting up cron to do it?

 For example, one could write some Puppet code to splay the cron times around
 so that only one should be running at once.
 Or, perhaps, a central orchestrator that is given some known quiet time and
 works its way through the list, running nodetool repair one at a time (using
 RPC?) until it runs out of time.

 Cheers,
 Edward
 --

 Edward Sargisson

 senior java developer
 Global Relay

 edward.sargis...@globalrelay.net


 866.484.6630
 New York | Chicago | Vancouver  |  London  (+44.0800.032.9829)  |  Singapore
 (+65.3158.1301)

 Global Relay Archive supports email, instant messaging, BlackBerry,
 Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook
 and more.


 Ask about Global Relay Message — The Future of Collaboration in the
 Financial Services World


 All email sent to or from this address will be retained by Global Relay’s
 email archiving system. This message is intended only for the use of the
 individual or entity to which it is addressed, and may contain information
 that is privileged, confidential, and exempt from disclosure under
 applicable law.  Global Relay will not be liable for any compliance or
 technical information provided herein.  All trademarks are the property of
 their respective owners.



 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero

Re: Advantage of pre-defining column metadata

2012-08-28 Thread Edward Capriolo

Setting the metadata will set the validation. If you insert to a
column that is supposed to only INT values Cassandra will reject non
INT data on insert time.

Also comparator can not be changed, you only get once chance to set
the column sorting.


On Tue, Aug 28, 2012 at 3:34 PM, A J s5a...@gmail.com wrote:
 For static column family what is the advantage in pre-defining column 
 metadata ?

 I can see ease of understanding type of values that the CF contains
 and that clients will reject incompatible insertion.

 But are there any major advantages in terms of performance or
 something else that makes it beneficial to define the metadata upfront
 ?

 Thanks.

Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade

2012-08-30 Thread Edward Capriolo

If you move from 7.X to 0.8X or 1.0X you have to rebuild sstables as
soon as possible. If you have large bloomfilters you can hit a bug
where the bloom filters will not work properly.


On Thu, Aug 30, 2012 at 9:44 AM, Илья Шипицин chipits...@gmail.com wrote:
 we are running somewhat queue-like with aggressive write-read patterns.
 I was looking for scripting queries from live Cassandra installation, but I
 didn't find any.

 is there something like thrift-proxy or other query logging/scripting engine
 ?

 2012/8/30 aaron morton aa...@thelastpickle.com

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8

 We've not had any reports of a performance drop off. All tests so far have
 show improvements in both read and write performance.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.

 The sha1 is created so we can diagnose corruptions in the -Data component
 of the SSTables. They are not used to save network IO.
 It is calculated while streaming the Memtable to disk so has no impact on
 disk IO. While not the fasted algorithm I would assume it's CPU overhead in
 this case is minimal.

  there's already relatively small Bloom filter file, which can be used for
 saving network traffic instead of sha1 digest.

 Bloom filters are used to test if a row key may exist in an SSTable.

 any explanation ?

 If you can provide some more information on your use case we may be able
 to help.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote:

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 after some investigation carried out I noticed files with sha1 extension
 (which are missing for Cassandra-0.7.8)

 in maybeWriteDigest() function I see no option fot switching sha1 digests
 off.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 why to use one more digest (which have to be calculated), there's already
 relatively small Bloom filter file, which can be used for saving network
 traffic instead of sha1 digest.

 any explanation ?

 Ilya Shipitsin

Re: Helenos - web based gui tool

2012-09-07 Thread Edward Capriolo

You might want to change the name. There is a  node.js driver for
cassandra with the same name. I am not sure which one of your got to
the name first.


On Thu, Sep 6, 2012 at 8:00 PM, aaron morton aa...@thelastpickle.com wrote:
 Thanks Tomek,
 Feel free to add it to
 http://wiki.apache.org/cassandra/Administration%20Tools

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 5/09/2012, at 9:54 AM, Tomek Kuprowski tomekkuprow...@gmail.com wrote:

 Dear all,

 I'm happy to announce a first release of Helenos. This is a web based gui
 tool to manage your data stored in Cassandra.

 Project site: https://github.com/tomekkup/helenos

 Some screens:  https://picasaweb.google.com/tomekkuprowski/Helenos

 Hope you'll find it usefull. I'll be grateful for your comments and
 opinions.

 --
 --
 Regards !
 Tomek Kuprowski

Re: cassandra performance looking great...

2012-09-07 Thread Edward Capriolo

Try to get Cassandra running the TPH-C benchmarks and beat oracle :)

On Fri, Sep 7, 2012 at 10:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 So we wrote 1,000,000 rows into cassandra and ran a simple S-SQL(Scalable 
 SQL) query of


 PARTITIONS n(:partition) SELECT n FROM TABLE as n WHERE n.numShares = :low 
 and n.pricePerShare = :price

 It ran in 60ms

 So basically playOrm is going to support millions of rows per partition.  
 This is great news.  We expect the join performance to be very similar since 
 the trees of pricePerShare and numShares are really no different than the 
 join trees.

 So, millions of rows per partition and as many partitions as you want, it 
 scales wonderfully…..CASSANDRA ROCKS

 Behind the scenes, there is a wide row per partition per index so the above 
 query behind the scenes has two rows each with 1,000,000 columns.

 Later,
 Dean

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-15 Thread Edward Capriolo

Generally tuning the garbage collector is a waste of time. Just follow
someone else's recommendation and use that.

The problem with tuning is that workloads change then you have to tune
again and again. New garbage collectors come out and you have to tune again
and again. Someone at your company reads a blog about some new jvm and its
awesomeness and you tune again and again, cassandra adds off heap caching
you tune again and again.

All this work takes a lot of time and usually results in  negligible
returns. Garbage collectors and tuning is not magic bullets.

On Wednesday, September 12, 2012, Peter Schuller 
peter.schul...@infidyne.com wrote:
 Our full gc:s are typically not very frequent. Few days or even weeks
 in between, depending on cluster.

 *PER NODE* that is. On a cluster of hundreds of nodes, that's pretty
 often (and all it takes is a single node).

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: JVM 7, Cass 1.1.1 and G1 garbage collector

2012-09-24 Thread Edward Capriolo

Haha Ok.
It is not a total waste, but practically your time is better spent in other
places. The problem is just about everything is a moving target, schema,
request rate, hardware. Generally tuning nudges a couple variables in one
direction or the other and you see some decent returns. But each nudge
takes a restart and a warm up period, and with how Cassandra distributes
requests you likely have to flip several nodes or all of them before you
can see the change! By the time you do that its probably a different day or
week. Essentially finding our if one setting is better then the other is
like a 3 day test in production.

Before c* I used to deal with this in tomcat. Once in a while we would get
a dev that read some article about tuning, something about a new jvm, or
collector. With bright eyed enthusiasm they would want to try tuning our
current cluster. They spend a couple days and measure something and say it
was good lower memory usage. Meanwhile someone else would come to me and
say higher 95th response time. More short pauses, fewer long pauses,
great taste, less filing.

Most people just want to roflscale their huroku cloud. Tuning stuff is
sysadmin work and the cloud has taught us that the cost of sysadmins are
needless waste of money.

Just kidding !

But I do believe the default cassandra settings are reasonable and
typically I find that most who look at tuning GC usually need more hardware
and actually need to be tuning something somewhere else.

G1 is the perfect example of a time suck. Claims low pause latency for big
heaps, and delivers something regarded by the Cassandra community (and
hbase as well) that works worse then CMS. If you spent 3 hours switching
tuning knobs and analysing, that is 3 hours of your life you will never get
back.

Better to let SUN and other people worry about tuning (at least from where
I sit)

On Saturday, September 15, 2012, Peter Schuller peter.schul...@infidyne.com
wrote:
 Generally tuning the garbage collector is a waste of time.

 Sorry, that's BS. It can be absolutely critical, when done right, and
 only useless when done wrong. There's a spectrum in between.

 Just follow
 someone else's recommendation and use that.

 No, don't.

 Most recommendations out there are completely useless in the general
 case because someone did some very specific benchmark under very
 specific circumstances and then recommends some particular combination
 of options. In order to understand whether a particular recommendation
 applies to you, you need to know enough about your use-case that I
 suspect you're better of just reading up on the available options and
 figuring things out. Of course, randomly trying various different
 settings to see which seems to work well may be realistic - but you
 loose predictability (in the face of changing patterns of traffic for
 example) if you don't know why it's behaving like it is.

 If you care about GC related behavior you want to understand how the
 application behaves, how the garbage collector behaves, what your
 requirements are, and select settings based on those requirements and
 how the application and GC behavior combine to produce emergent
 behavior. The best GC options may vary *wildly* depending on the
 nature of your cluster and your goals. There are also non-GC settings
 (in the specific case of Cassandra) that affect the interaction with
 the garbage collector, like whether you're using row/key caching, or
 things like phi conviction threshold and/or timeouts. It's very hard
 for anyone to give generalized recommendations. If it weren't,
 Cassandra would ship with The One True set of settings that are always
 the best and there would be no discussion.

 It's very unfortunate that the state of GC in the freely available
 JVM:s is at this point given that there exists known and working
 algorithms (and at least one practical implementation) that avoids it,
 mostly. But, it's the situation we're in. The only way around it that
 I know of if you're on Hotspot, is to have the application behave in
 such a way that it avoids the causes of un-predictable behavior w.r.t.
 GC by being careful about it's memory allocation and *retention*
 profile. For the specific case of avoiding *ever* seeing a full gc, it
 gets even more complex.

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: any ways to have compaction use less disk space?

2012-09-24 Thread Edward Capriolo

If you are using ext3 there is a hard limit on number if files in a
directory of 32K. EXT4 as a much higher limit (cant remember exactly
IIRC). So true that having many files is not a problem for the file
system though your VFS cache could be less efficient since you would
have a higher inode-data ratio.

Edward

On Mon, Sep 24, 2012 at 7:03 PM, Aaron Turner synfina...@gmail.com wrote:
On Mon, Sep 24, 2012 at 10:02 AM, Віталій Тимчишин tiv...@gmail.com wrote:
Why so?
What are pluses and minuses?
As for me, I am looking for number of files in directory.
700GB/512MB*5(files per SST) = 7000 files, that is OK from my view.
700GB/5MB*5 = 70 files, that is too much for single directory, too much
memory used for SST data, too huge compaction queue (that leads to strange
pauses, I suppose because of compactor thinking what to compact next),...

Not sure why a lot of files is a problem... modern filesystems deal
with that pretty well.

Really large sstables mean that compactions now are taking a lot more
disk IO and time to complete. Remember, Leveled Compaction is more
disk IO intensive, so using large sstables makes that even worse.
This is a big reason why the default is 5MB. Also, each level is 10x
the size as the previous level. Also, for level compaction, you need
10x the sstable size worth of free space to do compactions. So now
you need 5GB of free disk, vs 50MB of free disk.

Also, if you're doing deletes in those CF's, that old, deleted data is
going to stick around a LOT longer with 512MB files, because it can't
get deleted until you have 10x512MB files to compact to level 2.
Heaven forbid it doesn't get deleted then because each level is 10x
bigger so you end up waiting a LOT longer to actually delete that data
from disk.

Now, if you're using SSD's then larger sstables is probably doable,
but even then I'd guesstimate 50MB is far more reasonable then 512MB.

-Aaron

2012/9/23 Aaron Turner synfina...@gmail.com

On Sun, Sep 23, 2012 at 8:18 PM, Віталій Тимчишин tiv...@gmail.com
wrote:
If you think about space, use Leveled compaction! This won't only allow
you
to fill more space, but also will shrink you data much faster in case of
updates. Size compaction can give you 3x-4x more space used than there
are
live data. Consider the following (our simplified) scenario:
1) The data is updated weekly
2) Each week a large SSTable is written (say, 300GB) after full update
processing.
3) In 3 weeks you will have 1.2TB of data in 3 large SSTables.
4) Only after 4th week they all will be compacted into one 300GB
SSTable.

Leveled compaction've tamed space for us. Note that you should set
sstable_size_in_mb to reasonably high value (it is 512 for us with
~700GB
per node) to prevent creating a lot of small files.

512MB per sstable? Wow, that's freaking huge. From my conversations
with various developers 5-10MB seems far more reasonable. I guess it
really depends on your usage patterns, but that seems excessive to me-
especially as sstables are promoted.

--
Best regards,
Vitalii Tymchyshyn

--
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix
Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
carpe diem quam minimum credula postero

Re: 1000's of column families

2012-09-27 Thread Edward Capriolo

Hector also offers support for 'Virtual Keyspaces' which you might
want to look at.


On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote:
 On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 We have 1000's of different building devices and we stream data from these 
 devices.  The format and data from each one varies so one device has 
 temperature at timeX with some other variables, another device has CO2 
 percentage and other variables.  Every device is unique and streams it's own 
 data.  We dynamically discover devices and register them.  Basically, one CF 
 or table per thing really makes sense in this environment.  While we could 
 try to find out which devices are similar, this would really be a pain and 
 some devices add some new variable into the equation.  NOT only that but 
 researchers can register new datasets and upload them as well and each 
 dataset they have they do NOT want to share with other researches 
 necessarily so we have security groups and each CF belongs to security 
 groups.  We dynamically create CF's on the fly as people register new 
 datasets.

 On top of that, when the data sets get too large, we probably want to 
 partition a single CF into time partitions.  We could create one CF and put 
 all the data and have a partition per device, but then a time partition will 
 contain multiple devices of data meaning we need to shrink our time 
 partition size where if we have CF per device, the time partition can be 
 larger as it is only for that one device.

 THEN, on top of that, we have a meta CF for these devices so some people 
 want to query for streams that match criteria AND which returns a CF name 
 and they query that CF name so we almost need a query with variables like 
 select cfName from Meta where x = y and then select * from cfName where 
 x. Which we can do today.

 How strict are your security requirements?  If it wasn't for that,
 you'd be much better off storing data on a per-statistic basis then
 per-device.  Hell, you could store everything in a single CF by using
 a composite row key:

 devicename|stat type|instance

 But yeah, there isn't a hard limit for the number of CF's, but there
 is overhead associated with each one and so I wouldn't consider your
 design as scalable.  Generally speaking, hundreds are ok, but
 thousands is pushing it.



 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero

Re: Ball is rolling on High Performance Cassandra Cookbook second edition

2012-10-01 Thread Edward Capriolo

Hello all,

Work has begun on the second edition!  Keep hitting me up with ideas.
In particular I am looking for someone who has done work with
flume+Cassandra and pig+Cassandra. Both of these things topics will be
covered to some extent in the second edition, but these are two
instances in which I could use some help as I do not have extensive
experience with these two combinations.

Contact me if you have any other ideas as well.

Edward

On Tue, Jun 26, 2012 at 5:25 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Hello all,

 It has not been very long since the first book was published but
 several things have been added to Cassandra and a few things have
 changed. I am putting together a list of changed content, for example
 features like the old per Column family memtable flush settings versus
 the new system with the global variable.

 My editors have given me the green light to grow the second edition
 from ~200 pages currently up to 300 pages! This gives us the ability
 to add more items/sections to the text.

 Some things were missing from the first edition such as Hector
 support. Nate has offered to help me in this area. Please feel contact
 me with any ideas and suggestions of recipes you would like to see in
 the book. Also get in touch if you want to write a recipe. Several
 people added content to the first edition and it would be great to see
 that type of participation again.

 Thank you,
 Edward

Re: MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards

2012-10-07 Thread Edward Capriolo

I have not looked at this JMX object in a while, however the
compaction manager can support multiple threads. Also it moves from
0-filesize each time it has to compact a set of files.

That is more useful for showing current progress rather then lifetime history.



On Fri, Oct 5, 2012 at 7:27 PM, Bryan Talbot btal...@aeriagames.com wrote:
 I've recently added compaction rate (in bytes / second) to my monitors for
 cassandra and am seeing some odd values.  I wasn't expecting the values for
 TotalBytesCompacted to sometimes decrease from one reading to the next.  It
 seems that the value should be monotonically increasing while a server is
 running -- obviously it would start again at 0 when the server is restarted
 or if the counter rolls over (unlikely for a 64 bit long).

 Below are two samples taken 60 seconds apart: the value decreased by
 2,954,369,012 between the two readings.

 reported_metric=[timestamp:1349476449, status:200,
 request:[mbean:org.apache.cassandra.db:type=CompactionManager,
 attribute:TotalBytesCompacted, type:read], value:7548675470069]

 previous_metric=[timestamp:1349476389, status:200,
 request:[mbean:org.apache.cassandra.db:type=CompactionManager,
 attribute:TotalBytesCompacted, type:read], value:7551629839081]


 I briefly looked at the code for CompactionManager and a few related classes
 and don't see anyplace that is performing subtraction explicitly; however,
 there are many additions of signed long values that are not validated and
 could conceivably contain a negative value thus causing the
 totalBytesCompacted to decrease.  It's interesting to note that the all of
 the differences I've seen so far are more than the overflow value of a
 signed 32 bit value.  The OS (CentOS 5.7) and sun java vm (1.6.0_29) are
 both 64 bit.  JNA is enabled.

 Is this expected and normal?  If so, what is the correct interpretation of
 this metric?  I'm seeing the negatives values a few times per hour when
 reading it once every 60 seconds.

 -Bryan

Re: how to avoid range ghosts?

2012-10-07 Thread Edward Capriolo

Read this:

http://wiki.apache.org/cassandra/FAQ#range_ghosts

Then say this to yourself:

http://cn1.kaboodle.com/img/b/0/0/196/4/C1xHoQAAAZZL9w/ghostbusters-logo-i-aint-afraid-of-no-ghost-pinback-button-1.25-pin-badge.jpg?v=1320511953000

On Sun, Oct 7, 2012 at 4:15 AM, Satoshi Yamada
bigtvioletb...@yahoo.co.jpwrote:

 Hi,

 What is the recommended way to avoid range ghost in using get_range()?
 In my case, order of the key is not problem. It seems valid to use random
  :start_key in every query, but i'm new to cassandra and do not know if
 it's
 recommended or not.

 I use Cassandra 1.1.4 and ruby client. Range ghosts happens when one
 process keeps on inserting data while other process get_range and delete
 them.

 thanks in advance,
 satoshi

Re: can I have a mix of 32 and 64 bit machines in a cluster?

2012-10-09 Thread Edward Capriolo

Java abstracts you from all these problems. One thing to look out for
is JVM options do not work across all JVMs. For example if you try to
enable
https://wikis.oracle.com/display/HotSpotInternals/CompressedOops on a
32bit machine the JVM fails to start.

On Tue, Oct 9, 2012 at 1:45 PM, Brian Tarbox tar...@cabotresearch.com wrote:
 I can't imagine why this would be a problem but I wonder if anyone has
 experience with running a mix of 32 and 64 bit nodes in a cluster.

 (I'm not going to do this in production, just trying to make use of the gear
 I have for my local system).

 Thanks.

Re: unexpected behaviour on seed nodes when using -Dcassandra.replace_token

2012-10-19 Thread Edward Capriolo

Yes. That would be a good jira if it is not already listed. If node is
a seed node autobootstrap and replicate_token settings should trigger
a fatal non-start because your giving c* conflicting directions.

Edward

On Fri, Oct 19, 2012 at 8:49 AM, Thomas van Neerijnen
t...@bossastudios.com wrote:
 Hi all

 I recently tried to replace a dead node using
 -Dcassandra.replace_token=token, which so far has been good to me.
 However on one of my nodes this option was ignored and the node simply
 picked a different token to live at and started up there.

 It was a foolish mistake on my part because it was set as a seed node, which
 results in this error in the log file:
 INFO [main] 2012-10-19 12:41:00,886 StorageService.java (line 518) This node
 will not auto bootstrap because it is configured to be
  a seed node.
 but it seems a little scary that this would mean it'll just ignore the fact
 that you want a replace a token and put itself somewhere else in the
 cluster. Surely it should behave similarly to trying to replace a live node
 by throwing some kind of exception?

Java 7 support?

2012-10-24 Thread Edward Capriolo

We have been using cassandra and java7 for months. No problems. A key
concept of java is portable binaries. There are sometimes wrinkles with
upgrades. If you hit one undo the upgrade and restart.

On Tuesday, October 23, 2012, Eric Evans eev...@acunu.com wrote:
 On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote:
 On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson
 edward.sargis...@globalrelay.net wrote:
 The Datastax documentation says that Java 7 is not recommended[1].
However,
 Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that
 comment?

 I've asked this approximate question here a few times, with no
 official response. The reason I ask is that in addition to Java 7 not
 being recommended, in Java 7 OpenJDK becomes the reference JVM, and
 OpenJDK is also not recommended.

 From other channels, I have conjectured that the current advice on
 Java 7 is it 'works' but is not as extensively tested (and definitely
 not as commonly deployed) as Java 6.

 That sounds about right.  The best way to change the status quo would
 be to use Java 7, report any bugs you find, and share your
 experiences.

 --
 Eric Evans
 Acunu | http://www.acunu.com | @acunu

Re: Keeping the record straight for Cassandra Benchmarks...

2012-10-25 Thread Edward Capriolo

Yes another benchmark with 100,000,000 rows on EC2 machines probably
less powerful then my laptop. The benchmark might as well have run 4
vmware instances on the same desktop.


On Thu, Oct 25, 2012 at 7:40 AM, Brian O'Neill b...@alumni.brown.edu wrote:
 People probably saw...
 http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

 To clarify things take a look at...
 http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42

Re: Java 7 support?

2012-10-25 Thread Edward Capriolo

I am using the Sun JDK. There are only two issues I have found
unrelated to Cassandra.

1) DateFormat is more liberal mmDD vs yyymmdd If you write an
application with java 7 the format is forgiving with DD vs dd. Yet if
you deploy that application to some JDK 1.6 jvms it fails

2) Ran into some issues with timsort()
http://stackoverflow.com/questions/6626437/why-does-my-compare-method-throw-exception-comparison-method-violates-its-gen

Again neither of these manifested in cassandra but did manifest with
other applications.


On Wed, Oct 24, 2012 at 9:14 PM, Andrey V. Panov panov.a...@gmail.com wrote:
 Are you using openJDK or Oracle JDK? I know java7 should be based on openJDK
 since 7, but still not sure.

 On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote:

 We have been using cassandra and java7 for months. No problems. A key
 concept of java is portable binaries. There are sometimes wrinkles with
 upgrades. If you hit one undo the upgrade and restart.

Large results and network round trips

2012-10-25 Thread Edward Capriolo

Hello all,

Currently we implement wide rows for most of our entities. For example:

user {
 event1=x
 event2=y
 event3=z
 ...
}

Normally the entires are bounded to be less then 256 columns and most
columns are small in size say 30 bytes. Because the blind write nature
of Cassandra it is possible the column family can get much larger. We
have very low latency requirements for example say less then (5ms).

Considering network rountrip and all other factors I am wondering what
is the largest column that is possible in a 5ms window on a GB
network.  First we have our thrift limits 15MB, is it possible even in
the best case scenario to deliver a 15MB response in under 5ms on a
GigaBit ethernet for example? Does anyone have any real world numbers
with reference to package sizes and standard performance?

Thanks all,
Edward

Re: Large results and network round trips

2012-10-25 Thread Edward Capriolo

For this scenario, remove disk speed from the equation. Assume the row
is completely in Row Cache. Also lets assume Read.ONE. With this
information I would be looking to determine response size/maximum
requests second/max latency.

I would use this to say You want to do 5,000 reads/sec, on a GigaBit
ethernet, and each row is 10K, in under 5ms latency

Sorry that is impossible.




On Thu, Oct 25, 2012 at 2:58 PM, sankalp kohli kohlisank...@gmail.com wrote:
 I dont have any sample data on this, but read latency will depend on these
 1) Consistency level of the read
 2) Disk speed.

 Also you can look at the Netflix client as it makes the co-ordinator node
 same as the node which holds that data. This will reduce one hop.

 On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Hello all,

 Currently we implement wide rows for most of our entities. For example:

 user {
  event1=x
  event2=y
  event3=z
  ...
 }

 Normally the entires are bounded to be less then 256 columns and most
 columns are small in size say 30 bytes. Because the blind write nature
 of Cassandra it is possible the column family can get much larger. We
 have very low latency requirements for example say less then (5ms).

 Considering network rountrip and all other factors I am wondering what
 is the largest column that is possible in a 5ms window on a GB
 network.  First we have our thrift limits 15MB, is it possible even in
 the best case scenario to deliver a 15MB response in under 5ms on a
 GigaBit ethernet for example? Does anyone have any real world numbers
 with reference to package sizes and standard performance?

 Thanks all,
 Edward

Re: disable compaction node-wide

2012-10-27 Thread Edward Capriolo

If you are using sized teired set minCompactionThreshold to 0 and
maxCompactionThreshold to 0. You can probably also use this
https://issues.apache.org/jira/browse/CASSANDRA-2130

But if you do not compact the number of sstables gets high and then
read performance can suffer.


On Sat, Oct 27, 2012 at 4:21 PM, Radim Kolar h...@filez.com wrote:
 its possible to disable node wide all sstable compaction? I cant find
 anything suitable in JMX console.

Getting all schema in 1.2.0-beta-1

2012-11-03 Thread Edward Capriolo

Using 1.2.0-beta1. I am noticing that there is no longer a single way
to get all the schema. It seems like non-compact storage can be seen
with show schema, but other tables are not visible. Is this by design,
bug, or operator error?

http://pastebin.com/PdSDsdTz

How does Cassandra optimize this query?

2012-11-04 Thread Edward Capriolo

If we create a column family:

CREATE TABLE videos (
  videoid uuid,
  videoname varchar,
  username varchar,
  description varchar,
  tags varchar,
  upload_date timestamp,
  PRIMARY KEY (videoid,videoname)
);

The CLI views this column like so:

create column family videos
  with column_type = 'Standard'
  and comparator =
'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'UUIDType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and compression_options = {'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};

[default@videos] list videos;
Using default limit of 100
Using default column limit of 100
---
RowKey: b3a76c6b-7c7f-4af6-964f-803a9283c401
= (column=Now my dog plays piano!:description, value=My dog learned
to play the piano b
ecause of the cat., timestamp=135205828907)
= (column=Now my dog plays piano!:tags, value=dogs,piano,lol,
timestamp=1352058289070001)
invalid UTF8 bytes 0139794c30c0

SELECT * FROM videos WHERE videoname = 'My funny cat';

 videoid  | videoname| description
  | tags   | u
pload_date  | username
--+--+---++--
+--
 99051fe9-6a9c-46c2-b949-38ef78858dd0 | My funny cat | My cat likes to
play the piano! So funny. | cats,piano,lol | 2
012-06-01 08:00:00+ |ctodd


CQL3 Allows me to search the second component of a primary key. Which
really just seems to be component 1 of a composite column.

So what thrift operation does this correspond to? This looks like a
column slice without specifying a key? How does this work internally?

Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo

I see. It is fairly misleading because it is a query that does not
work at scale. This syntax is only helpful if you have less then a few
thousand rows in Cassandra.

On Mon, Nov 5, 2012 at 12:24 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Is this query the equivalent of a full table scan?  Without a starting
 point get_range_slice is just starting at token 0?


 It is, but that's what you asked for after all. If you want to start at a
 given token you can do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 'whatevertokenyouwant'
 You can even do:
   SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 
 token(99051fe9-6a9c-46c2-b949-38ef78858dd0)
 if that's simpler for you than computing the token manually. Though that is
 mostly for random partitioners. For ordered ones, you can do without the
 token() part.

 --
 Sylvain

Re: triggers(newbie)

2012-11-05 Thread Edward Capriolo

There are no built-in trigger. Someone has written an aspect oriented
piece to do triggers outside of the project.

http://brianoneill.blogspot.com/2012/03/cassandra-triggers-for-indexing-and.html

On Mon, Nov 5, 2012 at 12:30 PM,  davuk...@veleri.hr wrote:
 Hello!

 I was wondering if someone could help me a bit with triggers in cassandra.
 I am doing a school project with this DBMS, and i would be very happy if
 you could send me a simple example/explanation of a trigger.

 Thank you!! :)

Re: How does Cassandra optimize this query?

2012-11-05 Thread Edward Capriolo

 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

Yes. Bingo.

It is misleading because it is not useful in any other context besides
someone playing around with a ten row table in cqlsh. CQL stops me
from executing some queries that are not efficient, yet it allows this
one. If I am new to Cassandra and developing, this query works and
produces a result then once my database gets real data produces a
different result (likely an empty one).

When I first saw this query two things came to my mind.

1) CQL (and Cassandra) must be somehow indexing all the fields of a
primary key to make this search optimal.

2) This is impossible CQL must be gathering the first hundred random
rows and finding this thing.

What it is happening is case #2. In a nutshell CQL is just sampling
some data and running the query on it. We could support all types of
query constructs if we just take the first 100 rows and apply this
logic to it, but these things are not helpful for anything but light
ad-hoc data exploration.

My suggestions:
1) force people to supply a LIMIT clause on any query that is going to
page over get_range_slice
2) having some type of explain support so I can establish if this
query will work in the

I say this because as an end user I do not understand if a given query
is actually going to return the same results with different data.

On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote:

 On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 I see. It is fairly misleading because it is a query that does not
 work at scale. This syntax is only helpful if you have less then a few
 thousand rows in Cassandra.


 Just for the sake of argument, how is that misleading? If you have billions
 of rows and do the select statement from you initial mail, what did the
 syntax lead you to believe it would return?

 A remark like maybe we just shouldn't allow that and leave that to the
 map-reduce side would make sense, but I don't see how this is misleading.

 But again, this translate directly to a get_range_slice (that don't scale if
 you have billion of rows and don't limit the output either) so there is
 nothing new here.

Re: Multiple keyspaces vs Multiple CFs

2012-11-08 Thread Edward Capriolo

it is better to have one keyspace unless you need to replicate the
keyspaces differently. The main reason for this is that changing
keyspaces requires an RPC operation. Having 10 keyspaces would mean
having 10 connection pools.

On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com wrote:
 Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100
 keyspaces with 1 CF each.
 I am talking in terms of memory footprint.
 Also I would be interested to know how much better one is over other.

 Thanks,
 Sankalp

Re: Multiple keyspaces vs Multiple CFs

2012-11-08 Thread Edward Capriolo

Any connection pool. Imagine if you have 10 column families in 10
keyspaces. You pull a connection off the pool and the odds are 1 in 10
of it being connected to the keyspace you want. So 9 out of 10 times
you have to have a network round trip just to change the keyspace, or
you have to build a keyspace aware connection pool.
Edward

On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com wrote:
 Which connection pool are you talking about?


 On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 it is better to have one keyspace unless you need to replicate the
 keyspaces differently. The main reason for this is that changing
 keyspaces requires an RPC operation. Having 10 keyspaces would mean
 having 10 connection pools.

 On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com
 wrote:
  Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100
  keyspaces with 1 CF each.
  I am talking in terms of memory footprint.
  Also I would be interested to know how much better one is over other.
 
  Thanks,
  Sankalp

Re: Multiple keyspaces vs Multiple CFs

2012-11-08 Thread Edward Capriolo

In the old days the API looked like this.

  client.insert(Keyspace1,
 key_user_id,
   new ColumnPath(Standard1, null, name.getBytes(UTF-8)),
  Chris Goffinet.getBytes(UTF-8),
   timestamp,
   ConsistencyLevel.ONE);

but now it works like this

/pay attention to this below -/
client.set_keyspace(keyspace1);
/pay attention to this above -/
  client.insert(
 key_user_id,
 new ColumnPath(Standard1, null,
name.getBytes(UTF-8)),
  Chris Goffinet.getBytes(UTF-8),
  timestamp,
  ConsistencyLevel.ONE);

So each time you switch keyspaces you make a network round trip.

On Thu, Nov 8, 2012 at 6:17 PM, sankalp kohli kohlisank...@gmail.com wrote:
 I am a bit confused. One connection pool I know is the one which
 MessageService has to other nodes. Then there will be incoming connections
 via thrift from clients. How are they affected by multiple keyspaces?


 On Thu, Nov 8, 2012 at 3:14 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Any connection pool. Imagine if you have 10 column families in 10
 keyspaces. You pull a connection off the pool and the odds are 1 in 10
 of it being connected to the keyspace you want. So 9 out of 10 times
 you have to have a network round trip just to change the keyspace, or
 you have to build a keyspace aware connection pool.
 Edward

 On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com
 wrote:
  Which connection pool are you talking about?
 
 
  On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  it is better to have one keyspace unless you need to replicate the
  keyspaces differently. The main reason for this is that changing
  keyspaces requires an RPC operation. Having 10 keyspaces would mean
  having 10 connection pools.
 
  On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com
  wrote:
   Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100
   keyspaces with 1 CF each.
   I am talking in terms of memory footprint.
   Also I would be interested to know how much better one is over other.
  
   Thanks,
   Sankalp

Re: Multiple keyspaces vs Multiple CFs

2012-11-08 Thread Edward Capriolo

It is not as bad with hector, but still each Keyspace object is
another socket open to Cassandra. If you have 500 webservers and 10
keyspaces. Instead of having 5000 connections you now have 5000.

On Thu, Nov 8, 2012 at 6:35 PM, sankalp kohli kohlisank...@gmail.com wrote:
 I think this code is from the thrift part. I use hector. In hector, I can
 create multiple keyspace objects for each keyspace and use them when I want
 to talk to that keyspace. Why will it need to do a round trip to the server
 for each switch.


 On Thu, Nov 8, 2012 at 3:28 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 In the old days the API looked like this.

   client.insert(Keyspace1,
  key_user_id,
new ColumnPath(Standard1, null,
 name.getBytes(UTF-8)),
   Chris Goffinet.getBytes(UTF-8),
timestamp,
ConsistencyLevel.ONE);

 but now it works like this

 /pay attention to this below -/
 client.set_keyspace(keyspace1);
 /pay attention to this above -/
   client.insert(
  key_user_id,
  new ColumnPath(Standard1, null,
 name.getBytes(UTF-8)),
   Chris Goffinet.getBytes(UTF-8),
   timestamp,
   ConsistencyLevel.ONE);

 So each time you switch keyspaces you make a network round trip.

 On Thu, Nov 8, 2012 at 6:17 PM, sankalp kohli kohlisank...@gmail.com
 wrote:
  I am a bit confused. One connection pool I know is the one which
  MessageService has to other nodes. Then there will be incoming
  connections
  via thrift from clients. How are they affected by multiple keyspaces?
 
 
  On Thu, Nov 8, 2012 at 3:14 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  Any connection pool. Imagine if you have 10 column families in 10
  keyspaces. You pull a connection off the pool and the odds are 1 in 10
  of it being connected to the keyspace you want. So 9 out of 10 times
  you have to have a network round trip just to change the keyspace, or
  you have to build a keyspace aware connection pool.
  Edward
 
  On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com
  wrote:
   Which connection pool are you talking about?
  
  
   On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo
   edlinuxg...@gmail.com
   wrote:
  
   it is better to have one keyspace unless you need to replicate the
   keyspaces differently. The main reason for this is that changing
   keyspaces requires an RPC operation. Having 10 keyspaces would mean
   having 10 connection pools.
  
   On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli
   kohlisank...@gmail.com
   wrote:
Is it better to have 10 Keyspaces with 10 CF in each keyspace. or
100
keyspaces with 1 CF each.
I am talking in terms of memory footprint.
Also I would be interested to know how much better one is over
other.
   
Thanks,
Sankalp

Re: Retrieve Multiple CFs from Range Slice

2012-11-09 Thread Edward Capriolo

HBase is different is this regard. A table is comprised of multiple
column families, and they can be scanned at once. However, last time I
checked, scanning a table with two column families is still two
seeks across three different column families.

A similar thing can be accomplished in cassandra by issuing two range
scans, (possibly executing them asynchronously in two threads)

I am sure someone will correct me if I am mistaken.


On Fri, Nov 9, 2012 at 11:46 PM, Chris Larsen clar...@euphoriaaudio.com wrote:
 Hi! Is there a way to retrieve the columns for all column families on a
 given row while fetching range slices? My keyspace has two column families
 and when I’m scanning over the rows, I’d like to be able to fetch the
 columns in both CFs while iterating over the keys so as to avoid having to
 run two scan operations. When I set the CF to an empty string, ala
 ColumnParent.setColumn_family(), it throws an error “non-empty
 columnfamily is required”. (Using the Thrift API directly from JAVA on Cass
 1.1.6) My HBase scans can return both CFs per row so it works nicely.
 Thanks!

Re: leveled compaction and tombstoned data

2012-11-10 Thread Edward Capriolo

No it does not exist. Rob and I might start a donation page and give
the money to whoever is willing to code it. If someone would write a
tool that would split an sstable into 4 smaller sstables (even an
offline command line tool) I would paypal them a hundo.

On Sat, Nov 10, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote:
 Nope.  I think at least once a week I hear someone suggest one way to solve
 their problem is to write an sstablesplit tool.

 I'm pretty sure that:

 Step 1. Write sstablesplit
 Step 2. ???
 Step 3. Profit!



 On Sat, Nov 10, 2012 at 9:40 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Rob Coli

 Does the sstablesplit function exists somewhere ?



 2012/11/10 Jim Cistaro jcist...@netflix.com

 For some of our clusters, we have taken the periodic major compaction
 route.

 There are a few things to consider:
 1) Once you start major compacting, depending on data size, you may be
 committed to doing it periodically because you create one big file that
 will take forever to naturally compact agaist 3 like sized files.
 2) If you rely heavily on file cache (rather than large row caches), each
 major compaction effectively invalidates the entire file cache beause
 everything is written to one new large file.

 --
 Jim Cistaro

 On 11/9/12 11:27 AM, Rob Coli rc...@palominodb.com wrote:

 On Thu, Nov 8, 2012 at 10:12 AM, B. Todd Burruss bto...@gmail.com
  wrote:
  my question is would leveled compaction help to get rid of the
 tombstoned
  data faster than size tiered, and therefore reduce the disk space
  usage?
 
 You could also...
 
 1) run a major compaction
 2) code up sstablesplit
 3) profit!
 
 This method incurs a management penalty if not automated, but is
 otherwise the most efficient way to deal with tombstones and obsolete
 data.. :D
 
 =Rob
 
 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb
 





 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero

Re: [BETA RELEASE] Apache Cassandra 1.2.0-beta2 released

2012-11-10 Thread Edward Capriolo

just a note for all. The default partitioner is no longer
randompartitioner. It is now murmur, and the token range starts in negative
numbers. So you don't chose tokens Luke your father taught you anymore.

On Friday, November 9, 2012, Sylvain Lebresne sylv...@datastax.com wrote:
 The Cassandra team is pleased to announce the release of the second beta
for
 the future Apache Cassandra 1.2.0.
 Let me first stress that this is beta software and as such is *not* ready
for
 production use.
 This release is still beta so is likely not bug free. However, lots have
been
 fixed since beta1 and if everything goes right, we are hopeful that a
first
 release candidate may follow shortly. Please do help testing this beta to
help
 make that happen. If you encounter any problem during your testing, please
 report[3,4] them. And be sure to a look at the change log[1] and the
release
 notes[2] to see where Cassandra 1.2 differs from the previous series.
 Apache Cassandra 1.2.0-beta2[5] is available as usual from the cassandra
 website (http://cassandra.apache.org/download/) and a debian package is
 available using the 12x branch (see
http://wiki.apache.org/cassandra/DebianPackaging).
 Thank you for your help in testing and have fun with it.
 [1]: http://goo.gl/wnDAV (CHANGES.txt)
 [2]: http://goo.gl/CBsqs (NEWS.txt)
 [3]: https://issues.apache.org/jira/browse/CASSANDRA
 [4]: user@cassandra.apache.org
 [5]:
http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-1.2.0-beta2

Re: CREATE COLUMNFAMILY

2012-11-11 Thread Edward Capriolo

If you supply metadata cassandra can use it for several things.

1) It validates data on insertion
2) Helps display the information in human readable formats in tools
like the CLI and
sstabletojson
3) If you add a built-in secondary index the type information is
needed, strings sort differently then integer
4) columns in rows are sorted by the column name, strings sort
differently then integers

On Sat, Nov 10, 2012 at 11:55 PM, Kevin Burton rkevinbur...@charter.net wrote:
 I am sure this has been asked before but what is the purpose of entering
 key/value or more correctly key name/data type values on the CREATE
 COLUMNFAMILY command.

Re: removing SSTABLEs

2012-11-11 Thread Edward Capriolo

If you shutdown c* and remove an sstable (and it associated data,
index, bloom filter , and etc) files it is safe. I would delete any
saved caches as well.

It is safe in the sense that Cassandra will start up with no issues,
but you could be missing some data.

On Sun, Nov 11, 2012 at 11:09 PM, B. Todd Burruss bto...@gmail.com wrote:
 if i stop a node and remove an SSTABLE, let's call it X, is that safe?

 ok, more info.  i know that the data in SSTABLE X has been tombstoned
 but the tomstones are in SSTABLE Y.  i want to simply delete X and get
 rid of the data.

 how do i know this .. i did a major compaction a while back and the
 SSTABLE is so large it has not yet been compacted.  we delete data
 daily and only keep 7 days of data.  the SSTABLE is almost 30 days
 old.

 whattayathink?

Re: removing SSTABLEs

2012-11-12 Thread Edward Capriolo

Because you did a major compaction that table is larger then all the
rest. So it will never go away until you have 3 other tables about
that size or you run major compaction again.

You should vote on the ticket:

https://issues.apache.org/jira/browse/CASSANDRA-4766

On Mon, Nov 12, 2012 at 11:51 AM, Jason Wee peich...@gmail.com wrote:
 The existence of sstable X will give an impact to the system or cluster?
 when the compaction threshold is reach, the sstable x and sstable y will be
 compacted. it's more like the system responsibility than human intervention.


 On Mon, Nov 12, 2012 at 12:09 PM, B. Todd Burruss bto...@gmail.com wrote:

 if i stop a node and remove an SSTABLE, let's call it X, is that safe?

 ok, more info.  i know that the data in SSTABLE X has been tombstoned
 but the tomstones are in SSTABLE Y.  i want to simply delete X and get
 rid of the data.

 how do i know this .. i did a major compaction a while back and the
 SSTABLE is so large it has not yet been compacted.  we delete data
 daily and only keep 7 days of data.  the SSTABLE is almost 30 days
 old.

 whattayathink?

Re: unable to read saved rowcache from disk

2012-11-13 Thread Edward Capriolo

Yes the row cache could be incorrect so on startup cassandra verify they
saved row cache by re reading. It takes a long time so do not save a big
row cache.

On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote:
 I have a rowcache provieded by SerializingCacheProvider.
 The data that has been read into it is about 500MB, as claimed by
jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose
the size from jconsole is before serializing.
 Now while restarting Cassandra, it's unable to read saved rowcache back.
By unable, I mean around 4 hours and I have to abort it and remove cache
so as not to suspend other tasks.
 Since the data aren't huge, why Cassandra can't read it back?
 My Cassandra is 1.2.0-beta2.

Re: Read during digest mismatch

2012-11-13 Thread Edward Capriolo

I think the code base does not benefit from having too many different read
code paths. Logically what your suggesting is reasonable, but you have to
consider the case of one being slow to respond.

Then what?

On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote:
 If consistency is two, don't we just send data request to one and digest
request to another?

 On Mon, Nov 12, 2012 at 2:49 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Correct.  Which is one reason there is a separate setting for
 cross-datacenter read repair, by the way.

 On Thu, Nov 8, 2012 at 4:43 PM, sankalp kohli kohlisank...@gmail.com
wrote:
  Hi,
  Lets say I am reading with consistency TWO and my replication is
3. The
  read is eligible for global read repair. It will send a request to get
data
  from one node and a digest request to two.
  If there is a digest mismatch, what I am reading from the code looks
like it
  will get the data from all three nodes and do a resolve of the data
before
  returning to the client.
 
  Is it correct or I am readind the code wrong?
 
  Also if this is correct, look like if the third node is in other DC,
the
  read will slow down even when the consistency was TWO?
 
  Thanks,
  Sankalp
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com

Re: unable to read saved rowcache from disk

2012-11-13 Thread Edward Capriolo

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

A negative side-effect of a large row-cache is start-up time. The
periodic saving of the row cache information only saves the keys that
are cached; the data has to be pre-fetched on start-up. On a large
data set, this is probably going to be seek-bound and the time it
takes to warm up the row cache will be linear with respect to the row
cache size (assuming sufficiently large amounts of data that the seek
bound I/O is not subject to optimization by disks)

Assuming a row cache 15MB and the average row is 300 bytes, that could
be 50,000 entries. 4 hours seems like a long time to read back 50K
entries. Unless the source table was very large and you can only do a
small number / reads/sec.

On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote:
 incorrect... what do you mean? I think it's only 15MB, which is not big.


 On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Yes the row cache could be incorrect so on startup cassandra verify they
 saved row cache by re reading. It takes a long time so do not save a big row
 cache.


 On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote:
  I have a rowcache provieded by SerializingCacheProvider.
  The data that has been read into it is about 500MB, as claimed by
  jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose
  the size from jconsole is before serializing.
  Now while restarting Cassandra, it's unable to read saved rowcache back.
  By unable, I mean around 4 hours and I have to abort it and remove cache
  so as not to suspend other tasks.
  Since the data aren't huge, why Cassandra can't read it back?
  My Cassandra is 1.2.0-beta2.

Re: Offsets and Range Queries

2012-11-15 Thread Edward Capriolo

There are several reasons. First there is no absolute offset. The
rows are sorted by the data. If someone inserts new data between your
query and this query the rows have changed.

Unless you doing select queries inside a transaction with repeatable
read and your database supports this the query you mention does not
really have absolute offsets  either. The results of the query can
change between reads.

In cassandra we do not execute large queries (that might results to
temp tables or whatever) and allow you to page them. Slices have a
fixed size, this ensures that the the query does not execute for
arbitrary lengths of time.


On Thu, Nov 15, 2012 at 6:39 AM, Ravikumar Govindarajan
ravikumar.govindara...@gmail.com wrote:
 Usually we do a SELECT * FROM  ORDER BY  LIMIT 26,25 for pagination
 purpose, but specifying offset is not available for range queries in
 cassandra.

 I always have to specify a start-key to achieve this. Are there reasons for
 choosing such an approach rather than providing an absolute offset?

 --
 Ravi

Re: unable to read saved rowcache from disk

2012-11-15 Thread Edward Capriolo

If the startup is taking a long time or not working and you believe it
to be corrupt in some way it is safe to delete the saved cache files.
If you think the process is taking longer then it should you could try
attaching a debugger to the process.

I try to avoid the row cache these days, even with cache auto tuning
(which I am not using) 1 really wide row can cause issues. I like
letting the os disk cache do it's thing.



On Thu, Nov 15, 2012 at 2:20 AM, Wz1975 wz1...@yahoo.com wrote:
 Before shut down,  you saw rowcache has 500m, 1.6m rows,  each row average
 300B, so 700k row should be a little over 200m, unless it is reading more,
 maybe tombstone?  Or the rows on disk  have grown for some reason,  but row
 cache was not updated?  Could be something else eats up the memory.  You may
 profile memory and see who consumes the memory.


 Thanks.
 -Wei

 Sent from my Samsung smartphone on ATT


  Original message 
 Subject: Re: unable to read saved rowcache from disk
 From: Manu Zhang owenzhang1...@gmail.com
 To: user@cassandra.apache.org
 CC:


 3G, other jvm parameters are unchanged.


 On Thu, Nov 15, 2012 at 2:40 PM, Wz1975 wz1...@yahoo.com wrote:

 How big is your heap?  Did you change the jvm parameter?



 Thanks.
 -Wei

 Sent from my Samsung smartphone on ATT


  Original message 
 Subject: Re: unable to read saved rowcache from disk
 From: Manu Zhang owenzhang1...@gmail.com
 To: user@cassandra.apache.org
 CC:


 add a counter and print out myself


 On Thu, Nov 15, 2012 at 1:51 PM, Wz1975 wz1...@yahoo.com wrote:

 Curious where did you see this?


 Thanks.
 -Wei

 Sent from my Samsung smartphone on ATT


  Original message 
 Subject: Re: unable to read saved rowcache from disk
 From: Manu Zhang owenzhang1...@gmail.com
 To: user@cassandra.apache.org
 CC:


 OOM at deserializing 747321th row


 On Thu, Nov 15, 2012 at 9:08 AM, Manu Zhang owenzhang1...@gmail.com
 wrote:

 oh, as for the number of rows, it's 165. How long would you expect
 it to be read back?


 On Thu, Nov 15, 2012 at 3:57 AM, Wei Zhu wz1...@yahoo.com wrote:

 Good information Edward.
 For my case, we have good size of RAM (76G) and the heap is 8G. So I
 set the row cache to be 800M as recommended. Our column is kind of big, so
 the hit ratio for row cache is around 20%, so according to datastax, might
 just turn the row cache altogether.
 Anyway, for restart, it took about 2 minutes to load the row cache

  INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108)
 reading saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache
  INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451)
 completed loading (102801 ms; 21125 keys) row cache for XXX.f2

 Just for comparison, our key is long, the disk usage for row cache is
 253K. (it only stores key when row cache is saved to disk, so 253KB/ 
 8bytes
 = 31625 number of keys). It's about right...
 So for 15MB, there could be a lot of narrow rows. (if the key is
 Long, could be more than 1M rows)

 Thanks.
 -Wei
 
 From: Edward Capriolo edlinuxg...@gmail.com
 To: user@cassandra.apache.org
 Sent: Tuesday, November 13, 2012 11:13 PM
 Subject: Re: unable to read saved rowcache from disk

 http://wiki.apache.org/cassandra/LargeDataSetConsiderations

 A negative side-effect of a large row-cache is start-up time. The
 periodic saving of the row cache information only saves the keys that
 are cached; the data has to be pre-fetched on start-up. On a large
 data set, this is probably going to be seek-bound and the time it
 takes to warm up the row cache will be linear with respect to the row
 cache size (assuming sufficiently large amounts of data that the seek
 bound I/O is not subject to optimization by disks)

 Assuming a row cache 15MB and the average row is 300 bytes, that could
 be 50,000 entries. 4 hours seems like a long time to read back 50K
 entries. Unless the source table was very large and you can only do a
 small number / reads/sec.

 On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com
 wrote:
  incorrect... what do you mean? I think it's only 15MB, which is not
  big.
 
 
  On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo
  edlinuxg...@gmail.com
  wrote:
 
  Yes the row cache could be incorrect so on startup cassandra
  verify they
  saved row cache by re reading. It takes a long time so do not save a
  big row
  cache.
 
 
  On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com
  wrote:
   I have a rowcache provieded by SerializingCacheProvider.
   The data that has been read into it is about 500MB, as claimed by
   jconsole. After saving cache, it is around 15MB on disk. Hence, I
   suppose
   the size from jconsole is before serializing.
   Now while restarting Cassandra, it's unable to read saved rowcache
   back.
   By unable, I mean around 4 hours and I have to abort it and
   remove cache
   so as not to suspend other tasks.
   Since

Re: Question regarding the need to run nodetool repair

2012-11-15 Thread Edward Capriolo

On Thursday, November 15, 2012, Dwight Smith dwight.sm...@genesyslab.com
wrote:
 I have a 4 node cluster,  version 1.1.2, replication factor of 4,
read/write consistency of 3, level compaction. Several questions.



 1)  Should nodetool repair be run regularly to assure it has
completed before gc_grace?  If it is not run, what are the exposures?

Yes. Lost tombstones could cause deleted data to re appear.

 2)  If a node goes down, and is brought back up prior to the 1 hour
hinted handoff expiration, should repair be run immediately?

If node is brought up prior to 1 hour. You should let the hints replay.
 Repair is always safe to run.

 3)  If the hinted handoff has expired, the plan is to remove the node
and start a fresh node in its place.  Does this approach cause problems?

You only need to join a fresh mode if the node was down longer then gc
grace. Default is 10 days.


 Thanks



If you read and write at quorum and run repair regularly you can worry less
about the things above because they are essentially non factors.

Re: Admin for cassandra?

2012-11-15 Thread Edward Capriolo

We should build an eclipse plugin named Eclipsandra or something.

On Thu, Nov 15, 2012 at 9:45 PM, Wz1975 wz1...@yahoo.com wrote:
 Cqlsh is probably the closest you will get. Or pay big bucks to hire someone
 to develop one for you:)


 Thanks.
 -Wei

 Sent from my Samsung smartphone on ATT


  Original message 
 Subject: Admin for cassandra?
 From: Kevin Burton rkevinbur...@charter.net
 To: user@cassandra.apache.org
 CC:


 Is there an IDE for a Cassandra database? Similar to the SQL Server
 Management Studio for SQL server. I mainly want to execute queries and see
 the results. Preferably that runs under a Windows OS.



 Thank you.

Re: Collections, query for contains?

2012-11-19 Thread Edward Capriolo

This was my first question after I git the inserts working. Hive has udfs
like array contains. It also has lateral view syntax that is similar to
transposed.

On Monday, November 19, 2012, Timmy Turner timm.t...@gmail.com wrote:
 Is there no option to query for the contents of a collection?
 Something like
   select * from cf where c_list contains('some_value')
 or
   select * from cf where c_map contains('some_key')
 or
   select * from cf where c_map['some_key'] contains('some_value')

Re: SchemaDisagreementException

2012-11-19 Thread Edward Capriolo

even if you made the calls through cql you would have the same issue since
cql uses thrift. 1.2:0 is supposed to be nicer with concurrent
modifications.

On Monday, November 19, 2012, Everton Lima peitin.inu...@gmail.com wrote:
 I was using cassandra direct because it has more performace than using
CQL. Therefore, I am using cassandra because of replication factor and
consistence of data. I am using it as a lib of my app. I only make sample
querys, just use a key to point to a data.

 2012/11/16 Everton Lima peitin.inu...@gmail.com

 I do that because I need to create a dynamic column families.
 I create 2 keyspaces in  the start of application, using embedded
cassandra instance too, but it's never throw exception. And than, insert
dynamic column families in this 2 keyspaces.

 I put a Thread.sleep(3000); in the middle of the creation column family
code.

 int watiTime = 3000;
 logger.info(Waiting +(watiTime/1000)+s for synchronizing
...);
 Thread.sleep(watiTime);
 CassandraHelper.createColumnFamily(CassandraHelper.KEYSPACE,
layer);
 logger.info(Waiting +(watiTime/1000)+s for synchronizing
...);
 Thread.sleep(watiTime);

 I do that, because in the code of CassandraStress, after create a column
family, it do that too. It is wrong or good solution?
 Any other idea?



 2012/11/14 aaron morton aa...@thelastpickle.com

 Out of interest why are you creating column families by making direct
calls on an embedded cassandra instance ? I would guess you life would be
easier if you defined a schema in CQL or CLI.

 I already read in the documentation that this error occurs when more than
one thread/processor access the same place in the Cassandra, but I think
this is not occuring.

 How may nodes do you have ?

 I am using 3 nodes.

 What version are you running ?

 The version is 1.1.6

 It sounds like you have run simultaneous schema updates and the global
schema has diverged.
 If you can create your schema in CLI or CQL I would recommend doing that.
 If you are trying to do something more complicated you'll need to provide
more information.
 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 @aaronmorton
 http://www.thelastpickle.com
 On 15/11/2012, at 3:13 AM, Everton Lima peitin.inu...@gmail.com wrote:

 Some times, when I try to insert a data in Cassandra with Method:

 static void createColumnFamily(String keySpace, String columnFamily){
  synchronized (mutex){
   Iface cs = new CassandraServer();
   CfDef cfDef = new CfDef(keySpace, colu

Re: SchemaDisagreementException

2012-11-19 Thread Edward Capriolo

http://www.acunu.com/2/post/2011/12/cql-benchmarking.html

Last I checked, thrift still had an edge over cql due to string
serialization and de serialization. Might be even more dramatic for
later columns. Not that client speed matters much overall in
cassandra's speed, but CQL client does more.



On Mon, Nov 19, 2012 at 9:27 PM, Michael Kjellman
mkjell...@barracuda.com wrote:
 While this might not be helpful (I don't have all the thread history here),
 have you checked that all your servers are properly synced with NTP?

 From: Everton Lima peitin.inu...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Monday, November 19, 2012 6:24 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: SchemaDisagreementException

 Yes I already have tested. I use the Object CassandraServer to do the
 operations instead of open conection with CassandraClient. Both of this
 object implements Iface. I think the performace of use CassandraServer
 improve because it does not open an connection, and CassandraClient (that
 use thrift) and CQL open a connection.

 2012/11/19 Tyler Hobbs ty...@datastax.com

 Have you actually tested to see that the Thrift API is more performant
 than CQL for your application?  As far as I know, CQL almost always has a
 performance advantage over the Thrift API.


 On Mon, Nov 19, 2012 at 1:05 PM, Everton Lima peitin.inu...@gmail.com
 wrote:

 For some reason I can not reply my old thread in that list. So I was
 creating a new one.

 The problem is that I do not use thrift to gain in performace. Why it is
 nicer with concurrent modifications? I do not know why I have falling in the
 problem of concurrent modification if I was creating 2 keyspaces diferent in
 only one process with just one thread. Someone knows why?


 --

 Everton Lima Aleixo
 Bacharel em Ciencia da Computação
 Universidade Federal de Goiás




 --
 Tyler Hobbs
 DataStax




 --

 Everton Lima Aleixo
 Bacharel em Ciencia da Computação
 Universidade Federal de Goiás

 --
 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook

Re: Query regarding SSTable timestamps and counts

2012-11-20 Thread Edward Capriolo

On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com wrote:
 My understanding of the compaction process was that since data files keep
 continuously merging we should not have data files with very old last
 modified timestamps

 It is perfectly OK to have very old SSTables.

 But performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.

 upgradetables re-writes every sstable to have the same contents in the
 newest format.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:

 Hello Aaron,

 Thanks a lot for the reply.

 Looks like the documentation is confusing. Here is the link I am referring
 to:  http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction


 It does not disable compaction.
 As per the above url,  After running a major compaction, automatic minor
 compactions are no longer triggered, frequently requiring you to manually
 run major compactions on a routine basis. ( Just before the heading Tuning
 Column Family compression in the above link)

 With respect to the replies below :


 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.
 This is for the minor compaction and major compaction should theoretically
 result in one large file irrespective of the number of data files initially?

This is not something you have to worry about. Unless you are seeing
 1,000's of files using the default compaction.

 Well my worry has been because of the large amount of node movements we have
 done in the ring. We started off with 6 nodes and increased the capacity to
 12 with disproportionate increases every time which resulted in a lot of
 clean of data folders except system, run repair and then a cleanup with an
 aborted attempt in between.

 There were some data.db files older by more than 2 weeks and were not
 modified since then. My understanding of the compaction process was that
 since data files keep continuously merging we should not have data files
 with very old last modified timestamps (assuming there is a good amount of
 writes to the table continuously) I did not have a for sure way of telling
 if everything is alright with the compaction looking at the last modified
 timestamps of all the data.db files.

What are the compaction issues you are having ?
 Your replies confirm that the timestamps should not be an issue to worry
 about. So I guess I should not be calling them as issues any more.  But
 performing an upgradesstables did decrease the number of data files and
 removed all the data files with the old timestamps.



 Regards,
 Ananth


 On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com
 wrote:

 As per datastax documentation, a manual compaction forces the admin to
 start compaction manually and disables the automated compaction (atleast for
 major compactions but not minor compactions )

 It does not disable compaction.
 it creates one big file, which will not be compacted until there are (by
 default) 3 other very big files.


 1. Does a nodetool stop compaction also force the admin to manually run
 major compaction ( I.e. disable automated major compactions ? )

 No.
 Stop just stops the current compaction.
 Nothing is disabled.

 2. Can a node restart reset the automated major compaction if a node gets
 into a manual mode compaction for whatever reason ?

 Major compaction is not automatic. It is the manual nodetool compact
 command.
 Automatic (minor) compaction is controlled by min_compaction_threshold and
 max_compaction_threshold (for the default compaction strategy).

 3. What is the ideal  number of SSTables for a table in a keyspace ( I
 mean are there any indicators as to whether my compaction is alright or not
 ? )

 This is not something you have to worry about.
 Unless you are seeing 1,000's of files using the default compaction.

  For example, I have seen SSTables on the disk more than 10 days old
 wherein there were other SSTables belonging to the same table but much
 younger than the older SSTables (

 No problems.

 4. Does a upgradesstables fix any compaction issues ?

 What are the compaction issues you are having ?


 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com
 wrote:


 We have a cluster  running cassandra 1.1.4. On this cluster,

 1. We had to move the nodes around a bit  when we were adding new nodes
 (there was quite a good amount of node movement )

 2. We had to stop compactions during some of the days to save some disk
 space on some of the nodes when they were running very very low on disk
 spaces. (via nodetool stop COMPACTION)


 As per datastax documentation, a manual

Re: Other problem in update

2012-11-27 Thread Edward Capriolo

I am just taking a stab at this one. UUID's interact with system time and
maybe your real time os is doing something funky there. The other option,
which seems more likely, is that your unit tests are not cleaning up their
data directory and there is some corrupt data in there.

On Tue, Nov 27, 2012 at 7:40 AM, Everton Lima peitin.inu...@gmail.comwrote:

 People, when i try to execute my program that use
 EmbeddedCassandraService, with the version 1.1.2 of cassandra in OpenSuse
 Real Time operation system it is throwing the follow exception:


 [27/11/12 10:27:28,314 BRST] ERROR service.CassandraDaemon: Exception in
 thread Thread[MutationStage:20,5,main]
 java.lang.NullPointerException
 at org.apache.cassandra.utils.UUIDGen.decompose(UUIDGen.java:96)
 at
 org.apache.cassandra.cql.jdbc.JdbcUUID.decompose(JdbcUUID.java:55)
 at
 org.apache.cassandra.db.marshal.UUIDType.decompose(UUIDType.java:187)
 at
 org.apache.cassandra.db.RowMutation.hintFor(RowMutation.java:107)
 at
 org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:582)
 at
 org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:557)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)

 When I try to execute the same program in Ubuntu 12.04 the program starts
 without ERRORS.
 Someone could help me??

 --

 Everton Lima Aleixo
 Bacharel em Ciencia da Computação
 Universidade Federal de Goiás

Re: Java high-level client

2012-11-27 Thread Edward Capriolo

Hector does not require an outdated version of thift, you are likely using
an outdated version of hector.

Here is the long and short of it: If the thrift thrift API changes then
hector can have compatibility issues. This happens from time to time. The
main methods like get() and insert() have remained the same, but the
CFMetaData objects have changed. (this causes the incompatible class stuff
you are seeing).

CQLhas a different version of the same problem, the CQL syntax is
version-ed. For example, if you try to execute a CQL3 query as a CQL2query
it will likely fail.

In the end your code still has to be version aware. With hector you get a
compile time problem, with pure CQL you get a runtime problem.

I have always had the opinion the project should have shipped hector with
Cassandra, this would have made it obvious what version is likely to work.

The new CQL transport client is not being shipped with Cassandra either, so
you will still have to match up the versions. Although they should be
largely compatible some time in the near or far future one of the clients
probably wont work with one of the servers.

Edward


On Tue, Nov 27, 2012 at 11:10 AM, Michael Kjellman
mkjell...@barracuda.comwrote:

 Netflix has a great client

 https://github.com/Netflix/astyanax

 On 11/27/12 7:40 AM, Peter Lin wool...@gmail.com wrote:

 I use hector-client master, which is pretty stable right now.
 
 It uses the latest thrift, so you can use hector with thrift 0.9.0.
 That's assuming you don't mind using the active development branch.
 
 
 
 On Tue, Nov 27, 2012 at 10:36 AM, Carsten Schnober
 schno...@ids-mannheim.de wrote:
  Hi,
  I'm aware that this has been a frequent question, but answers are still
  hard to find: what's an appropriate Java high-level client?
  I actually believe that the lack of a single maintained Java API that is
  packaged with Cassandra is quite an issue. The way the situation is
  right now, new users have to pick more or less randomly one of the
  available options from the Cassandra Wiki and find a suitable solution
  for their individual requirements through trial implementations. This
  can cause and lot of wasted time (and frustration).
 
  Personally, I've played with Hector before figuring out that it seems to
  require an outdated Thrift version. Downgrading to Thrift 0.6 is not an
  option for me though because I use Thrift 0.9.0 in other classes of the
  same project.
  So I've had a look at Kundera and at Easy-Cassandra. Both seem to lack a
  real documentation beyond the examples available in their Github
  repositories, right?
 
  Can more experienced users recommend either one of the two or some of
  the other options listed at the Cassandra Wiki? I know that this
  strongly depends on individual requirements, but all I need are simple
  requests for very basic queries. So I would like to emphasize the
  importance a clear documentation and a stable and well-maintained API.
  Any hints?
  Thanks!
  Carsten
 
  --
  Institut für Deutsche Sprache | http://www.ids-mannheim.de
  Projekt KorAP | http://korap.ids-mannheim.de
  Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
  Korpusanalyseplattform der nächsten Generation
  Next Generation Corpus Analysis Platform


 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.

 Visit http://barracudanetworks.com/facebook

Re: counters + replication = awful performance?

2012-11-27 Thread Edward Capriolo

The difference between Replication factor =1 and replication factor  1 is
significant. Also it sounds like your cluster is 2 node so going from RF=1
to RF=2 means double the load on both nodes.

You may want to experiment with the very dangerous column family attribute:

- replicate_on_write: Replicate every counter update from the leader to the
follower replicas. Accepts the values true and false.

Edward
On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman
mkjell...@barracuda.comwrote:

 Are you writing with QUORUM consistency or ONE?

 On 11/27/12 9:52 AM, Sergey Olefir solf.li...@gmail.com wrote:

 Hi Juan,
 
 thanks for your input!
 
 In my case, however, I doubt this is the case -- clients are able to push
 many more updates than I need to saturate replication_factor=2 case (e.g.
 I'm doing as many as 6x more increments when testing 2-node cluster with
 replication_factor=1), so bandwidth between clients and server should be
 sufficient.
 
 Bandwidth between nodes in the cluster should also be quite sufficient
 since
 they are both in the same DC. But it is something to check, thanks!
 
 Best regards,
 Sergey
 
 
 Juan Valencia wrote
  Hi Sergey,
 
  I know I've had similar issues with counters which were bottle-necked by
  network throughput.  You might be seeing a problem with throughput
 between
  the clients and Cass or between the two Cass nodes.  It might not be
 your
  case, but that was what happened to me :-)
 
  Juan
 
 
  On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt;
 
  solf.lists@
 
  gt; wrote:
 
  Hi,
 
  I have a serious problem with counters performance and I can't seem to
  figure it out.
 
  Basically I'm building a system for accumulating some statistics on
 the
  fly via Cassandra distributed counters. For this I need counter
 updates
  to
  work really fast and herein lies my problem -- as soon as I enable
  replication_factor = 2, the performance goes down the drain. This
 happens
  in
  my tests using both 1.0.x and 1.1.6.
 
  Let me elaborate:
 
  I have two boxes (virtual servers on top of physical servers rented
  specifically for this purpose, i.e. it's not a cloud, nor it is shared;
  virtual servers are managed by our admins as a way to limit damage as I
  suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner
  because
  I want to be able to do some range queries.
 
  First, I set up Cassandra individually on each box (not in a cluster)
 and
  test counter increments performance (exclusively increments, no reads).
  For
  tests I use code that is intended to somewhat resemble the expected
 load
  pattern -- particularly the majority of increments create new counters
  with
  some updating (adding) to already existing counters. In this test each
  single node exhibits respectable performance - something on the order
 of
  70k
  (seventy thousand) increments per second.
 
  I then join both of these nodes into single cluster (using SimpleSnitch
  and
  SimpleStrategy, nothing fancy yet). I then run the same test using
  replication_factor=1. The performance is on the order of 120k
 increments
  per
  second -- which seems to be a reasonable increase over the single node
  performance.
 
 
  HOWEVER I then rerun the same test on the two-node cluster using
  replication_factor=2 -- which is the least I'll need for actual
  production
  for redundancy purposes. And the performance I get is absolutely
 horrible
  --
  much, MUCH worse than even single-node performance -- something on the
  order
  of less than 25k increments per second. In addition to clients not
 being
  able to push updates fast enough, I also see a lot of 'messages
 dropped'
  messages in the Cassandra log under this load.
 
  Could anyone advise what could be causing such drastic performance drop
  under replication_factor=2? I was expecting something on the order of
  single-node performance, not approximately 3x less.
 
 
  When testing replication_factor=2 on 1.1.6 I can see that CPU usage
 goes
  through the roof. On 1.0.x I think it looked more like disk overload,
 but
  I'm not sure (being on virtual server I apparently can't see true
  iostats).
 
  I do have Cassandra data on a separate disk, commit log and cache are
  currently on the same disk as the system. I experimented with commit
 log
  flush modes and even with disabling commit log at all -- but it doesn't
  seem
  to have noticeable impact on the performance when under
  replication_factor=2.
 
 
  Any suggestions and hints will be much appreciated :) And please let me
  know
  if I need to share additional information about the configuration I'm
  running on.
 
  Best regards,
  Sergey
 
 
 
  --
  View this message in context:
 
 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counter
 s-replication-awful-performance-tp7583993.html
  Sent from the
 
  cassandra-user@.apache
 
   mailing list archive at
  Nabble.com.
 
 
 
 
  --
 
  Learn More:  SQI (Social Quality Index) - A Universal Measure of

Re: selective replication of keyspaces

2012-11-27 Thread Edward Capriolo

You can do something like this:

Divide your nodes up into 4 datacenters art1,art2,art3,core

[default@unknown] create keyspace art1 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art1:2,core:2}];

[default@unknown] create keyspace art2 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art2:2,core:2}];

[default@unknown] create keyspace art3 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art3:2,core:2}];

[default@unknown] create keyspace core placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{core:2}];



On Tue, Nov 27, 2012 at 5:02 PM, Artist jer...@simpleartmarketing.comwrote:


 I have 3 art-servers each has a cassandra cluster.
 Each of the art-servers has config/state information stored in a Keyspaces
 respectively called
 art-server-1-current-state, art-server-2-current-state,
 art-server-3-current-state

 in my core server I have a separate Cassandra cluster.  I would like to use
 Cassandra to replicate the current-state of each art-server on the core
 cassandra server without sharing that information with any of the
 art-servers.

 Is there is a way to replicate the keyspaces to a single Cassandra cluster
 my core without having any peer sharing between the 3 art-servers.

 -
 Artist



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/selective-replication-of-keyspaces-tp7584007.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.

Re: counters + replication = awful performance?

2012-11-27 Thread Edward Capriolo

I mispoke really. It is not dangerous you just have to understand what it
means. this jira discusses it.

https://issues.apache.org/jira/browse/CASSANDRA-3868

On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay sco...@mailchannels.comwrote:

  We're having a similar performance problem.  Setting 'replicate_on_write:
 false' fixes the performance issue in our tests.

 How dangerous is it?  What exactly could go wrong?

 On 12-11-27 01:44 PM, Edward Capriolo wrote:

 The difference between Replication factor =1 and replication factor  1 is
 significant. Also it sounds like your cluster is 2 node so going from RF=1
 to RF=2 means double the load on both nodes.

  You may want to experiment with the very dangerous column family
 attribute:

  - replicate_on_write: Replicate every counter update from the leader to
 the
 follower replicas. Accepts the values true and false.

  Edward
  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman 
 mkjell...@barracuda.com wrote:

 Are you writing with QUORUM consistency or ONE?

 On 11/27/12 9:52 AM, Sergey Olefir solf.li...@gmail.com wrote:

 Hi Juan,
 
 thanks for your input!
 
 In my case, however, I doubt this is the case -- clients are able to push
 many more updates than I need to saturate replication_factor=2 case (e.g.
 I'm doing as many as 6x more increments when testing 2-node cluster with
 replication_factor=1), so bandwidth between clients and server should be
 sufficient.
 
 Bandwidth between nodes in the cluster should also be quite sufficient
 since
 they are both in the same DC. But it is something to check, thanks!
 
 Best regards,
 Sergey
 
 
 Juan Valencia wrote
  Hi Sergey,
 
  I know I've had similar issues with counters which were bottle-necked
 by
  network throughput.  You might be seeing a problem with throughput
 between
  the clients and Cass or between the two Cass nodes.  It might not be
 your
  case, but that was what happened to me :-)
 
  Juan
 
 
  On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt;
 
  solf.lists@
 
  gt; wrote:
 
  Hi,
 
  I have a serious problem with counters performance and I can't seem to
  figure it out.
 
  Basically I'm building a system for accumulating some statistics on
 the
  fly via Cassandra distributed counters. For this I need counter
 updates
  to
  work really fast and herein lies my problem -- as soon as I enable
  replication_factor = 2, the performance goes down the drain. This
 happens
  in
  my tests using both 1.0.x and 1.1.6.
 
  Let me elaborate:
 
  I have two boxes (virtual servers on top of physical servers rented
  specifically for this purpose, i.e. it's not a cloud, nor it is
 shared;
  virtual servers are managed by our admins as a way to limit damage as
 I
  suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner
  because
  I want to be able to do some range queries.
 
  First, I set up Cassandra individually on each box (not in a cluster)
 and
  test counter increments performance (exclusively increments, no
 reads).
  For
  tests I use code that is intended to somewhat resemble the expected
 load
  pattern -- particularly the majority of increments create new counters
  with
  some updating (adding) to already existing counters. In this test each
  single node exhibits respectable performance - something on the order
 of
  70k
  (seventy thousand) increments per second.
 
  I then join both of these nodes into single cluster (using
 SimpleSnitch
  and
  SimpleStrategy, nothing fancy yet). I then run the same test using
  replication_factor=1. The performance is on the order of 120k
 increments
  per
  second -- which seems to be a reasonable increase over the single node
  performance.
 
 
  HOWEVER I then rerun the same test on the two-node cluster using
  replication_factor=2 -- which is the least I'll need for actual
  production
  for redundancy purposes. And the performance I get is absolutely
 horrible
  --
  much, MUCH worse than even single-node performance -- something on the
  order
  of less than 25k increments per second. In addition to clients not
 being
  able to push updates fast enough, I also see a lot of 'messages
 dropped'
  messages in the Cassandra log under this load.
 
  Could anyone advise what could be causing such drastic performance
 drop
  under replication_factor=2? I was expecting something on the order of
  single-node performance, not approximately 3x less.
 
 
  When testing replication_factor=2 on 1.1.6 I can see that CPU usage
 goes
  through the roof. On 1.0.x I think it looked more like disk overload,
 but
  I'm not sure (being on virtual server I apparently can't see true
  iostats).
 
  I do have Cassandra data on a separate disk, commit log and cache are
  currently on the same disk as the system. I experimented with commit
 log
  flush modes and even with disabling commit log at all -- but it
 doesn't
  seem
  to have noticeable impact on the performance when under
  replication_factor=2.
 
 
  Any suggestions and hints will be much

Re: counters + replication = awful performance?

2012-11-27 Thread Edward Capriolo

Cassandra's counters read on increment. Additionally they are distributed
so that can be multiple reads on increment. If they are not fast enough and
you have avoided all tuning options add more servers to handle the load.

In many cases incrementing the same counter n times can be avoided.

Twitter's rainbird did just that. It avoided multiple counter increments by
batching them.

I have done a similar think using cassandra and Kafka.

https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java

On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote:
Hi, thanks for your suggestions.

Regarding replicate=2 vs replicate=1 performance: I expected that below
configurations will have similar performance:
- single node, replicate = 1
- two nodes, replicate = 2 (okay, this probably should be a bit slower due
to additional overhead).

However what I'm seeing is that second option (replicate=2) is about THREE
times slower than single node.

Regarding replicate_on_write -- it is, in fact, a dangerous option. As
JIRA
discusses, if you make changes to your ring (moving tokens and such) you
will *silently* lose data. That is on top of whatever data you might end
up
losing if you run replicate_on_write=false and the only node that got the
data fails.

But what is much worse -- with replicate_on_write being false the data
will
NOT be replicated (in my tests) ever unless you explicitly request the
cell.
Then it will return the wrong result. And only on subsequent reads it will
return adequate results. I haven't tested it, but documentation states
that
range query will NOT do 'read repair' and thus will not force replication.
The test I did went like this:
- replicate_on_write = false
- write something to node A (which should in theory replicate to node B)
- wait for a long time (longest was on the order of 5 hours)
- read from node B (and here I was getting null / wrong result)
- read from node B again (here you get what you'd expect after read
repair)

In essence, using replicate_on_write=false with rarely read data will
practically defeat the purpose of having replication in the first place
(failover, data redundancy).

Or, in other words, this option doesn't look to be applicable to my
situation.

It looks like I will get much better performance by simply writing to two
separate clusters rather than using single cluster with replicate=2. Which
is kind of stupid :) I think something's fishy with counters and
replication.

Edward Capriolo wrote
I mispoke really. It is not dangerous you just have to understand what it
means. this jira discusses it.

https://issues.apache.org/jira/browse/CASSANDRA-3868

On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt;

scottm@

gt;wrote:

We're having a similar performance problem. Setting
'replicate_on_write:
false' fixes the performance issue in our tests.

How dangerous is it? What exactly could go wrong?

On 12-11-27 01:44 PM, Edward Capriolo wrote:

The difference between Replication factor =1 and replication factor 1
is
significant. Also it sounds like your cluster is 2 node so going from
RF=1
to RF=2 means double the load on both nodes.

You may want to experiment with the very dangerous column family
attribute:

- replicate_on_write: Replicate every counter update from the leader to
the
follower replicas. Accepts the values true and false.

Edward
On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman

mkjellman@

wrote:

Are you writing with QUORUM consistency or ONE?

On 11/27/12 9:52 AM, Sergey Olefir lt;

solf.lists@

gt; wrote:

Hi Juan,

thanks for your input!

In my case, however, I doubt this is the case -- clients are able to
push
many more updates than I need to saturate replication_factor=2 case
(e.g.
I'm doing as many as 6x more increments when testing 2-node cluster
with
replication_factor=1), so bandwidth between clients and server should
be
sufficient.

Bandwidth between nodes in the cluster should also be quite sufficient
since
they are both in the same DC. But it is something to check, thanks!

Best regards,
Sergey

Juan Valencia wrote
Hi Sergey,

I know I've had similar issues with counters which were
bottle-necked
by
network throughput. You might be seeing a problem with throughput
between
the clients and Cass or between the two Cass nodes. It might not be
your
case, but that was what happened to me :-)

Juan

On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt;

solf.lists@

gt; wrote:

Hi,

I have a serious problem with counters performance and I can't seem
to
figure it out.

Basically I'm building a system for accumulating some statistics
on
the
fly via Cassandra distributed counters. For this I need counter
updates
to
work really fast and herein lies my problem -- as soon as I
enable
replication_factor = 2

Re: counters + replication = awful performance?

2012-11-27 Thread Edward Capriolo

By the way the other issues you are seeing with replicate on write at false
could be because you did not repair. You should do that when changing rf.
On Tuesday, November 27, 2012, Edward Capriolo edlinuxg...@gmail.com
wrote:

In many cases incrementing the same counter n times can be avoided.

Twitter's rainbird did just that. It avoided multiple counter increments
by batching them.

I have done a similar think using cassandra and Kafka.

https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java

On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote:
Hi, thanks for your suggestions.

However what I'm seeing is that second option (replicate=2) is about
THREE
times slower than single node.

But what is much worse -- with replicate_on_write being false the data
will
NOT be replicated (in my tests) ever unless you explicitly request the
cell.
Then it will return the wrong result. And only on subsequent reads it
will
return adequate results. I haven't tested it, but documentation states
that
range query will NOT do 'read repair' and thus will not force
replication.
The test I did went like this:
- replicate_on_write = false
- write something to node A (which should in theory replicate to node B)
- wait for a long time (longest was on the order of 5 hours)
- read from node B (and here I was getting null / wrong result)
- read from node B again (here you get what you'd expect after read
repair)

In essence, using replicate_on_write=false with rarely read data will
practically defeat the purpose of having replication in the first place
(failover, data redundancy).

Or, in other words, this option doesn't look to be applicable to my
situation.

It looks like I will get much better performance by simply writing to two
separate clusters rather than using single cluster with replicate=2.
Which
is kind of stupid :) I think something's fishy with counters and
replication.

Edward Capriolo wrote
I mispoke really. It is not dangerous you just have to understand what
it
means. this jira discusses it.

https://issues.apache.org/jira/browse/CASSANDRA-3868

On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt;

scottm@

gt;wrote:

We're having a similar performance problem. Setting
'replicate_on_write:
false' fixes the performance issue in our tests.

How dangerous is it? What exactly could go wrong?

On 12-11-27 01:44 PM, Edward Capriolo wrote:

The difference between Replication factor =1 and replication factor 1
is
significant. Also it sounds like your cluster is 2 node so going from
RF=1
to RF=2 means double the load on both nodes.

You may want to experiment with the very dangerous column family
attribute:

- replicate_on_write: Replicate every counter update from the leader
to
the
follower replicas. Accepts the values true and false.

Edward
On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman

mkjellman@

wrote:

Are you writing with QUORUM consistency or ONE?

On 11/27/12 9:52 AM, Sergey Olefir lt;

solf.lists@

gt; wrote:

Hi Juan,

thanks for your input!

Bandwidth between nodes in the cluster should also be 'Like'
us on Facebook for exclusive content and other resources on all
Barracuda Networks solutions.

Visit http://barracudanetworks.com/facebook

--
*Scott McKay*, Sr. Software Developer
MailChannels

Tel: +1 604 685 7488 x 509
www.mailchannels.com

--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584011.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive
at Nabble.com.

Re: counters + replication = awful performance?

2012-11-27 Thread Edward Capriolo

Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node.
If you go to rf2 that is 100 inserts a node. If you were at 75 % capacity
on each mode your now at 150% which is not possible so things bog down.

To figure out what is going on we would need to see tpstat, iostat , and
top information.

I think your looking at the performance the wrong way. Starting off at rf 1
is not the way to understand cassandra performance.

You do not get the benefits of scala out don't happen until you fix your
rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf
3 even better.
On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote:
I already do a lot of in-memory aggregation before writing to Cassandra.

The question here is what is wrong with Cassandra (or its configuration)
that causes huge performance drop when moving from 1-replication to
2-replication for counters -- and more importantly how to resolve the
problem. 2x-3x drop when moving from 1-replication to 2-replication on two
nodes is reasonable. 6x is not. Like I said, with this kind of performance
degradation it makes more sense to run two clusters with replication=1 in
parallel rather than rely on Cassandra replication.

And yes, Rainbird was the inspiration for what we are trying to do here :)

Edward Capriolo wrote
Cassandra's counters read on increment. Additionally they are distributed
so that can be multiple reads on increment. If they are not fast enough
and
you have avoided all tuning options add more servers to handle the load.

In many cases incrementing the same counter n times can be avoided.

Twitter's rainbird did just that. It avoided multiple counter increments
by
batching them.

I have done a similar think using cassandra and Kafka.

https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java

On Tuesday, November 27, 2012, Sergey Olefir lt;

solf.lists@

gt; wrote:
Hi, thanks for your suggestions.

However what I'm seeing is that second option (replicate=2) is about
THREE
times slower than single node.

But what is much worse -- with replicate_on_write being false the data
will
NOT be replicated (in my tests) ever unless you explicitly request the
cell.
Then it will return the wrong result. And only on subsequent reads it
will
return adequate results. I haven't tested it, but documentation states
that
range query will NOT do 'read repair' and thus will not force
replication.
The test I did went like this:
- replicate_on_write = false
- write something to node A (which should in theory replicate to node B)
- wait for a long time (longest was on the order of 5 hours)
- read from node B (and here I was getting null / wrong result)
- read from node B again (here you get what you'd expect after read
repair)

In essence, using replicate_on_write=false with rarely read data will
practically defeat the purpose of having replication in the first place
(failover, data redundancy).

Or, in other words, this option doesn't look to be applicable to my
situation.

It looks like I will get much better performance by simply writing to
two
separate clusters rather than using single cluster with replicate=2.
Which
is kind of stupid :) I think something's fishy with counters and
replication.

Edward Capriolo wrote
I mispoke really. It is not dangerous you just have to understand what
it
means. this jira discusses it.

https://issues.apache.org/jira/browse/CASSANDRA-3868

On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt;

scottm@

gt;wrote:

We're having a similar performance problem. Setting
'replicate_on_write:
false' fixes the performance issue in our tests.

How dangerous is it? What exactly could go wrong?

On 12-11-27 01:44 PM, Edward Capriolo wrote:

The difference between Replication factor =1 and replication factor
1
is
significant. Also it sounds like your cluster is 2 node so going from
RF=1
to RF=2 means double the load on both nodes.

You may want to experiment with the very dangerous column family
attribute:

- replicate_on_write: Replicate every counter update from the leader
to
the
follower replicas. Accepts the values true and false.

Edward
On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman

mkjellman@

wrote:

Are you writing with QUORUM consistency

Re: selective replication of keyspaces

2012-11-27 Thread Edward Capriolo

My mistake that is older cli syntax, I wad just showing the concept set up
4 datacenter and selectively replicate keyspaces between them.

On Tuesday, November 27, 2012, jer...@simpleartmarketing.com 
jer...@simpleartmarketing.com wrote:
 Thank you.  This is a good  start I was beginning to think it couldn't be
done.

 When I run the command I get the error

 syntax error at position 21: missing EOF at 'placement_strategy'

 that is probably because I still need to set the correct properties in
the conf files


 On November 27, 2012 at 5:41 PM Edward Capriolo edlinuxg...@gmail.com
wrote:

 You can do something like this:

 Divide your nodes up into 4 datacenters art1,art2,art3,core

 [default@unknown] create keyspace art1 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art1:2,core:2}];

 [default@unknown] create keyspace art2 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art2:2,core:2}];

 [default@unknown] create keyspace art3 placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{art3:2,core:2}];

 [default@unknown] create keyspace core placement_strategy =
'org.apache.cassandra.locator.NetworkTopologyStrategy' and
strategy_options=[{core:2}];


 On Tue, Nov 27, 2012 at 5:02 PM, Artist jer...@simpleartmarketing.com
wrote:

 I have 3 art-servers each has a cassandra cluster.
 Each of the art-servers has config/state information stored in a Keyspaces
 respectively called
 art-server-1-current-state, art-server-2-current-state,
 art-server-3-current-state

 in my core server I have a separate Cassandra cluster.  I would like to
use
 Cassandra to replicate the current-state of each art-server on the core
 cassandra server without sharing that information with any of the
 art-servers.

 Is there is a way to replicate the keyspaces to a single Cassandra cluster
 my core without having any peer sharing between the 3 art-servers.

 -
 Artist



 --
 View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/selective-replication-of-keyspaces-tp7584007.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.

Re: Generic questions over Cassandra 1.1/1.2

2012-11-27 Thread Edward Capriolo

@Bill

Are you saying that now cassandra is less schema less ? :)

Compact storage is the schemaless of old.

On Tuesday, November 27, 2012, Bill de hÓra b...@dehora.net wrote:
 I'm not sure I always
 understand what people mean by schema less
 exactly and I'm curious.

 For 'schema less', given this -

 {{{
 cqlsh use example;
 cqlsh:example CREATE TABLE users (
 ...  user_name varchar,
 ...  password varchar,
 ...  gender varchar,
 ...  session_token varchar,
 ...  state varchar,
 ...  birth_year bigint,
 ...  PRIMARY KEY (user_name)
 ... );
 }}}

 I expect this would not cause an unknown identifier error -

 {{{
 INSERT INTO users
 (user_name, password, extra, moar)
 VALUES
 ('bob', 'secret', 'a', 'b');
 }}}

 but definitions vary.

 Bill

 On 26/11/12 09:18, Sylvain Lebresne wrote:

 On Mon, Nov 26, 2012 at 8:41 AM, aaron morton aa...@thelastpickle.com
 mailto:aa...@thelastpickle.com wrote:
Is there any noticeable performance difference between thrift or
CQL3?
   Off the top of my head it's within 5% (maybe 10%) under stress tests.
 See Eric's talk at the Cassandra SF conference for the exact numbers.

 Eric's benchmark results was that normal queries were slightly slower
 but prepared one (and in real life, I see no good reason not to prepare
 statements) were actually slightly faster.

   CQL 3 requires a schema, however altering the schema is easier. And
 in 1.2 will support concurrent schema modifications.
   Thrift API is still schema less.

 Sorry to hijack this thread, but I'd be curious (like seriously, I'm not
 trolling) to understand what you mean by CQL 3 requires a schema but
 Thrift API is still schema less. Basically I'm not sure I always
 understand what people mean by schema less exactly and I'm curious.

 --
 Sylvain

Re: counters + replication = awful performance?

2012-11-28 Thread Edward Capriolo

I may be wrong but during a bootstrap hints can be silently discarded, if
the node they are destined for leaves the ring.

There are a large number of people using counters for 5 minute real-time
statistics. On the back end they use ETL based reporting to compute the
true stats at a hourly or daily interval.

A user like this might benefit from DANGER counters. They are not looking
for perfection, only better performance, and the counter row keys
themselves role over in 5 minutes anyway.

Options like this are also great for winning benchmarks. When someone other
NoSQL (that is not has fast as c*) wants to win a benchmark they turn
off/on WAL, or write acks, or something that compromises their ACID/CAP
story for the purpose of winning. We need our own secret awesome-sauce
dangerous options too! jk


On Wed, Nov 28, 2012 at 4:21 AM, Rob Coli rc...@palominodb.com wrote:

 On Tue, Nov 27, 2012 at 3:21 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:
  I mispoke really. It is not dangerous you just have to understand what it
  means. this jira discusses it.
 
  https://issues.apache.org/jira/browse/CASSANDRA-3868

 Per Sylvain on the referenced ticket :

 
 I don't disagree about the efficiency of the valve, but at what price?
 'Bootstrapping a node will make you lose increments (you don't know
 which ones, you don't know how many and this even if nothing goes
 wrong)' is a pretty bad drawback. That is pretty much why that option
 makes me uncomfortable: it does give you better performance, so people
 may be tempted to use it. Now if it was only a matter of replicating
 writes only through read-repair/repair, then ok, it's pretty dangerous
 but it's rather easy to explain/understand the drawback (if you don't
 lose a disk, you don't lose increments, and you'd better use CL.ALL or
 have read_repair_chance to 1). But the fact that it doesn't work with
 bootstrap/move makes me wonder if having the option at all is not
 making a disservice to users.
 

 To me anything that can be described as will make you lose increments
 (you don't know which ones, you don't know how many and this even if
 nothing goes wrong) and which therefore doesn't work with
 bootstrap/move is correctly described as dangerous. :D

 =Rob

 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb

Re: counters + replication = awful performance?

2012-11-28 Thread Edward Capriolo

Just for reference HBase's counters also do a local read. I am not saying
they work better/worse/faster/slower but I would not suspect any system
that reads on increment to me significantly faster then what Cassandra
does.

Just saying your counter throughput is read bound, this is not unique to
C*'s implementation.



On Wed, Nov 28, 2012 at 2:41 PM, Sergey Olefir solf.li...@gmail.com wrote:

 Well, those are sad news then. I don't think I can consider 20k increments
 per second for a two node cluster (with RF=2) a reasonable performance
 (cost
 vs. benefit).

 I might have to look into other storage solutions or perhaps experiment
 with
 duplicate clusters with RF=1 or replicate_on_write=false.

 Although yes, I probably should try that row cache you mentioned -- I saw
 that key cache was going unused (so saw no reason to try to enable row
 cache), but I think it was on RF=1, it might be different on RF=2.


 Sylvain Lebresne-3 wrote
  Counters replication works in different ways than the one of normal
  writes. Namely, a counter update is written to a first replica, then a
  read
  is perform and the result of that is replicated to the other nodes. With
  RF=1, since there is only one replica no read is involved but in a way
  it's
  a degenerate case. So there is two reason why RF2 is much slower than
  RF=1:
  1) it involves a read to replicate and that read takes times. Especially
  if
  that read hits the disk, it may even dominate the insertion time.
  2) the replication to the first replica and the one to the res of the
  replica are not done in parallel but sequentially. Note that this is only
  true for the first replica versus the othere. In other words, from RF=2
 to
  RF=3 you should see a significant performance degradation.
 
  Note that while there is nothing you can do for 2), you can try to speed
  up
  1) by using row cache for instance (in case you weren't).
 
  In other words, with counters, it is expected that RF=1 be potentially
  much
  faster than RF1. That is the way counters works.
 
  And don't get me wrong, I'm not suggesting you should use RF=1 at all.
  What
  I am saying is that the performance you see with RF=2 is the performance
  of
  counters in Cassandra.
 
  --
  Sylvain
 
 
  On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir lt;

  solf.lists@

  gt; wrote:
 
  I think there might be a misunderstanding as to the nature of the
  problem.
 
  Say, I have test set T. And I have two identical servers A and B.
  - I tested that server A (singly) is able to handle load of T.
  - I tested that server B (singly) is able to handle load of T.
  - I then join A and B in the cluster and set replication=2 -- this means
  that each server in effect has to handle full test load individually
  (because there are two servers and replication=2 it means that each
  server
  effectively has to handle all the data written to the cluster). Under
  these
  circumstances it is reasonable to assume that cluster A+B shall be able
  to
  handle load T because each server is able to do so individually.
 
  HOWEVER, this is not the case. In fact, A+B together are only able to
  handle
  less than 1/3 of T DESPITE the fact that A and B individually are able
 to
  handle T just fine.
 
  I think there's something wrong with Cassandra replication (possibly as
  simple as me misconfiguring something) -- it shouldn't be three times
  faster
  to write to two separate nodes in parallel as compared to writing to
  2-node
  Cassandra cluster with replication=2.
 
 
  Edward Capriolo wrote
   Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a
  node.
   If you go to rf2 that is 100 inserts a node.  If you were at 75 %
  capacity
   on each mode your now at 150% which is not possible so things bog
 down.
  
   To figure out what is going on we would need to see tpstat, iostat ,
  and
   top information.
  
   I think your looking at the performance the wrong way. Starting off at
  rf
   1
   is not the way to understand cassandra performance.
  
   You do not get the benefits of scala out don't happen until you fix
  your
   rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes
 at
  rf
   3 even better.
   On Tuesday, November 27, 2012, Sergey Olefir lt;
 
   solf.lists@
 
   gt; wrote:
   I already do a lot of in-memory aggregation before writing to
  Cassandra.
  
   The question here is what is wrong with Cassandra (or its
  configuration)
   that causes huge performance drop when moving from 1-replication to
   2-replication for counters -- and more importantly how to resolve the
   problem. 2x-3x drop when moving from 1-replication to 2-replication
 on
   two
   nodes is reasonable. 6x is not. Like I said, with this kind of
   performance
   degradation it makes more sense to run two clusters with
 replication=1
  in
   parallel rather than rely on Cassandra replication.
  
   And yes, Rainbird was the inspiration for what we are trying to do
  here

Re: Java high-level client

2012-11-28 Thread Edward Capriolo

Astyanax is a hector fork. You can see many of the hector' authors comments
still in the astyanax code. There is some nice stuff in there but (IMHO) I
do not see the fork as necessary. It has split up the community a bit, as
there are now 3 high level Java clients.

I would advice follow Josh's advice
http://www.youtube.com/watch?v=nPG4sK_glls . Go to reddit and select
whatever sexy technology is new and trending :)


On Wed, Nov 28, 2012 at 2:51 PM, Michael Kjellman
mkjell...@barracuda.comwrote:

 Lots of example code, nice api, good performance as the first things that
 come to mind why I like Astyanax better than Hector

 From: Andrey Ilinykh ailin...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Wednesday, November 28, 2012 11:49 AM
 To: user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu 
 wz1...@yahoo.com
 Subject: Re: Java high-level client

 First at all, it is backed by Netflix. They used it production for long
 time, so it is pretty solid. Also they have nice tool (Priam) which makes
 cassandra cloud (AWS) friendly. This is important for us.

 Andrey


 On Wed, Nov 28, 2012 at 11:53 AM, Wei Zhu wz1...@yahoo.com wrote:

 We are using Hector now. What is the major advantage of astyanax over
 Hector?

 Thanks.
 -Wei

 --
 *From:* Andrey Ilinykh ailin...@gmail.com
 *To:* user@cassandra.apache.org
 *Sent:* Wednesday, November 28, 2012 9:37 AM

 *Subject:* Re: Java high-level client

 +1


 On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman 
 mkjell...@barracuda.com wrote:

 Netflix has a great client

 https://github.com/Netflix/astyanax





 --
 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook

Re: Rename cluster

2012-11-29 Thread Edward Capriolo

Since the cluster name is only cosmetic people do not often change it. I
would not do this in a production cluster for sure.

On Thu, Nov 29, 2012 at 2:56 PM, Wei Zhu wz1...@yahoo.com wrote:

 Hi,
 I am trying to rename a cluster by following the instruction on Wiki:

 Cassandra says ClusterName mismatch: oldClusterName != newClusterName
 and refuses to start
 To prevent operator errors, Cassandra stores the name of the cluster in
 its system table. If you need to rename a cluster for some reason, you can:
 Perform these steps on each node:

1. Start the cassandra-cli connected locally to this node.
2. Run the following:
   1. use system;
   2. set LocationInfo http://wiki.apache.org/cassandra/LocationInfo
   
 [utf8('L')][utf8('ClusterNamehttp://wiki.apache.org/cassandra/ClusterName')]=utf8('new
   cluster name');
   3. exit;
3. Run nodetool flush on this node.
4. Update the cassandra.yaml file for the cluster_name as the same as
2b).
5. Restart the node.

 Once all nodes have been had this operation performed and restarted,
 nodetool ring should show all nodes as UP.

 Get the following error:
 Connected to: Test Cluster on 10.200.128.151/9160
 Welcome to Cassandra CLI version 1.1.6

 Type 'help;' or '?' for help.
 Type 'quit;' or 'exit;' to quit.

 [default@unknown] use system;
 Authenticated to keyspace: system
 [default@system] set
 LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('General Services
 Cluster');
 system keyspace is not user-modifiable.
 InvalidRequestException(why:system keyspace is not user-modifiable.)
 at
 org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15974)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:797)
 at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:781)
 at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:909)
 at
 org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:222)
 at
 org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219)
 at org.apache.cassandra.cli.CliMain.main(CliMain.java:346)

 I have to remove the data directory in order to change the cluster name.
 Luckily it's my testing box, so no harm. Just wondering what has been
 changed not to allow the modification through cli? What is the way of
 changing the cluster name without wiping out all the data now?

 Thanks.
 -Wei

Re: Row caching + Wide row column family == almost crashed?

2012-12-02 Thread Edward Capriolo

Row cache has to store the entire row. It is a very bad option for wide
rows.

On Sunday, December 2, 2012, Mike mthero...@yahoo.com wrote:
 Hello,

 We recently hit an issue within our Cassandra based application.  We
 have a relatively new Column Family with some very wide rows (10's of
 thousands of columns, or more in some cases).  During a periodic
 activity, we the range of columns to retrieve various pieces of
 information, a segment at a time.

 We do these same queries frequently at various stages of the process,
 and I thought the application could see a performance benefit from row
 caching.  We have a small row cache (100MB per node) already enabled,
 and I enabled row caching on the new column family.

 The results were very negative.  When performing range queries with a
 limit of 200 results, for a small minority of the rows in the new column
 family, performance plummeted.  CPU utilization on the Cassandra node
 went through the roof, and it started chewing up memory.  Some queries
 to this column family hung completely.

 According to the logs, we started getting frequent GCInspector
 messages.  Cassandra started flushing the largest mem_tables due to
 hitting the flush_largest_memtables_at of 75%, and scaling back the
 key/row caches.  However, to Cassandra's credit, it did not die with an
 OutOfMemory error.  Its measures to emergency measures to conserve
 memory worked, and the cluster stayed up and running.  No real errors
 showed in the logs, except for Messages getting drop, which I believe
 was caused by what was going on with CPU and memory.

 Disabling row caching on this new column family has resolved the issue
 for now, but, is there something fundamental about row caching that I am
 missing?

 We are running Cassandra 1.1.2 with a 6 node cluster, with a replication
 factor of 3.

 Thanks,
 -Mike

Re: What is substituting keys_cached column family argument

2012-12-06 Thread Edward Capriolo

Rob,

Have you played with this I have many CFs, some big some small some using
large caches some using small ones, some that take many requests, some that
take a few.

Over time I have cooked up a strategy for how to share the cache love, even
thought it may not be the best solution to the problem I feel it makes
sense.

I can not figure out how I am going to be happy with global caches that I
do not control the size. What is your take on this?

Edward

On Wed, Dec 5, 2012 at 2:05 PM, Rob Coli rc...@palominodb.com wrote:

 On Wed, Dec 5, 2012 at 9:06 AM, Roman Yankin ro...@cognitivematch.com
 wrote:
  In Cassandra v 0.7 there was a column family property called
 keys_cached, now it's gone and I'm struggling to understand which of the
 below properties it's now substituted (if substituted at all)?

 Key and row caches are global in modern cassandra. You opt CFs out of
 the key cache, not opt in, because the default setting is keys_only
 on a per-CF basis.


 http://www.datastax.com/docs/1.1/configuration/node_configuration#row-cache-keys-to-save


 http://www.datastax.com/docs/1.1/configuration/node_configuration#key-cache-keys-to-save


 http://www.datastax.com/docs/1.1/configuration/storage_configuration#caching

 =Rob

 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb

Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

2012-12-06 Thread Edward Capriolo

http://wiki.apache.org/cassandra/LargeDataSetConsiderations


On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L 
wade.l.poziom...@intel.com wrote:

  “Having so much data on each node is a potential bad day.”

 ** **

 Is this discussed somewhere on the Cassandra documentation (limits,
 practices etc)?  We are also trying to load up quite a lot of data and have
 hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up
 on big data usage of Cassandra.  Meaning terabyte size databases.  

 ** **

 I do get your point about the amount of time required to recover downed
 node. But this 300-400MB business is interesting to me.

 ** **

 Thanks in advance.

 ** **

 Wade

 ** **

 *From:* aaron morton [mailto:aa...@thelastpickle.com]
 *Sent:* Wednesday, December 05, 2012 9:23 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered
 compaction.

 ** **

 Basically we were successful on two of the nodes. They both took ~2 days
 and 11 hours to complete and at the end we saw one very large file ~900GB
 and the rest much smaller (the overall size decreased). This is what we
 expected!

 I would recommend having up to 300MB to 400MB per node on a regular HDD
 with 1GB networking. 

 ** **

 But on the 3rd node, we suspect major compaction didn't actually finish
 it's job…

 The file list looks odd. Check the time stamps, on the files. You should
 not have files older than when compaction started. 

 ** **

 8GB heap 

 The default is 4GB max now days. 

 ** **

 1) Do you expect problems with the 3rd node during 2 weeks more of
 operations, in the conditions seen below? 

 I cannot answer that. 

 ** **

 2) Should we restart with leveled compaction next year? 

 I would run some tests to see how it works for you workload. 

 ** **

 4) Should we consider increasing the cluster capacity?

 IMHO yes.

 You may also want to do some experiments with turing compression on if it
 not already enabled. 

 ** **

 Having so much data on each node is a potential bad day. If instead you
 had to move or repair one of those nodes how long would it take for
 cassandra to stream all the data over ? (Or to rsync the data over.) How
 long does it take to run nodetool repair on the node ?

 ** **

 With RF 3, if you lose a node you have lost your redundancy. It's
 important to have a plan about how to get it back and how long it may take.
   

 ** **

 Hope that helps. 

 ** **

 -

 Aaron Morton

 Freelance Cassandra Developer

 New Zealand

 ** **

 @aaronmorton

 http://www.thelastpickle.com

 ** **

 On 6/12/2012, at 3:40 AM, Alexandru Sicoe adsi...@gmail.com wrote:



 

 Hi guys,
 Sorry for the late follow-up but I waited to run major compactions on all
 3 nodes at a time before replying with my findings.

 Basically we were successful on two of the nodes. They both took ~2 days
 and 11 hours to complete and at the end we saw one very large file ~900GB
 and the rest much smaller (the overall size decreased). This is what we
 expected!

 But on the 3rd node, we suspect major compaction didn't actually finish
 it's job. First of all nodetool compact returned much earlier than the rest
 - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node
 only about 36GB were freed up (almost the same size as before). Saw nothing
 in the server log (debug not enabled). Below I pasted some more details
 about file sizes before and after compaction on this third node and disk
 occupancy.

 The situation is maybe not so dramatic for us because in less than 2 weeks
 we will have a down time till after the new year. During this we can
 completely delete all the data in the cluster and start fresh with TTLs for
 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).

 Questions:

 1) Do you expect problems with the 3rd node during 2 weeks more of
 operations, in the conditions seen below?
 [Note: we expect the minor compactions to continue building up files but
 never really getting to compacting the large file and thus not needing much
 temporarily extra disk space].

 2) Should we restart with leveled compaction next year?
 [Note: Aaron was right, we have 1 week rows which get deleted after 1
 month which means older rows end up in big files = to free up space with
 SizeTiered we will have no choice but run major compactions which we don't
 know if they will work provided that we get at ~1TB / node / 1 month. You
 can see we are at the limit!]

 3) In case we keep SizeTiered:

 - How can we improve the performance of our major compactions? (we
 left all config parameters as default). Would increasing compactions
 throughput interfere with writes and reads? What about multi-threaded
 compactions?

 - Do we still need to run regular repair operations as well? Do these
 also do a major compaction

Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count?

2012-12-07 Thread Edward Capriolo

Good point . hadoop sprays its blocks around randomly. Thus if replication
factor nodes are down some blocks are not found. The larger the cluster the
higher chance nodes are down.

To deal with this increase rf once the cluster gets to be very large.


On Wednesday, December 5, 2012, Eric Parusel ericparu...@gmail.com wrote:
 Hi all,
 I've been wondering about virtual nodes and how cluster uptime might
change as cluster size increases.
 I understand clusters will benefit from increased reliability due to
faster rebuild time, but does that hold true for large clusters?
 It seems that since (and correct me if I'm wrong here) every physical
node will likely share some small amount of data with every other node,
that as the count of physical nodes in a Cassandra cluster increases (let's
say into the triple digits) that the probability of at least one failure to
Quorum read/write occurring in a given time period would *increase*.
 Would this hold true, at least until physical nodes becomes greater than
num_tokens per node?

 I understand that the window of failure for affected ranges would
probably be small but we do Quorum reads of many keys, so we'd likely hit
every virtual range with our queries, even if num_tokens was 256.
 Thanks,
 Eric

Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count?

2012-12-10 Thread Edward Capriolo

Assuming you need to work with quorum in a non-vnode scenario. That means
that if 2 nodes in a row in the ring are down some number of quorum
operations will fail with UnavailableException (TimeoutException right
after the failures). This is because the for a given range of tokens quorum
will be impossible, but quorum will be possible for others.

In a vnode world if any two nodes are down,  then the intersection of vnode
token ranges they have are unavailable.

I think it is two sides of the same coin.


On Mon, Dec 10, 2012 at 7:41 AM, Richard Low r...@acunu.com wrote:

 Hi Tyler,

 You're right, the math does assume independence which is unlikely to be
 accurate.  But if you do have correlated failure modes e.g. same power,
 racks, DC, etc. then you can still use Cassandra's rack-aware or DC-aware
 features to ensure replicas are spread around so your cluster can survive
 the correlated failure mode.  So I would expect vnodes to improve uptime in
 all scenarios, but haven't done the math to prove it.

 Richard.

Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo

Until the secondary indexes do not read before write is in a release and
stabilized you should follow Ed ENuff s blog and do your indexing yourself
with composites.

On Thursday, December 13, 2012, aaron morton aa...@thelastpickle.com
wrote:
 The IndexClause for the get_indexed_slices takes a start key. You can
page the results from your secondary index query by making multiple calls
with a sane count and including a start key.
 Cheers
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 @aaronmorton
 http://www.thelastpickle.com
 On 13/12/2012, at 6:34 PM, Chengying Fang cyf...@ngnsoft.com wrote:

 You are right, Dean. It's due to the heavy result returned by query, not
index itself. According to my test, if the result  rows less than 5000,
it's very quick. But how to limit the result? It seems row limit is a good
choice. But if do so, some rows I wanted  maybe miss because the row order
not fulfill query conditions.
 For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order
by C1. If I1=foo return 1 limit 100, I can't get the right result of
C1. Also we can not always set row range fulfill the query conditions when
doing query. Maybe I should redesign the CF model to fix it.

 -- Original --
 From:  Hiller, Deandean.hil...@nrel.gov;
 Date:  Wed, Dec 12, 2012 10:51 PM
 To:  user@cassandra.apache.orguser@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?

 You could always try PlayOrm's query capability on top of cassandra
;)….it works for us.

 Dean

 From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, December 11, 2012 8:22 PM
 To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Why Secondary indexes is so slowly by my test?

 Thanks to Low. We use CompositeColumn to substitue it in single
not-equality and definite equalitys query. And we will give up cassandra
because of the weak query ability and unstability. Many times, we found our
data in confusion without definite  cause in our cluster. For example, only
two rows in one CF,
row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some
times, it becomes
row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the
wrong column value.


 -- Original --
 From:  Richard Lowr...@acunu.commailto:r...@acunu.com;
 Date:  Tue, Dec 11, 2012 07:44 PM
 To:  useruser@cassandra.apache.orgmailto:user@cassandra.apache.org;
 Subject:  Re: Why Secondary indexes is so slowly by my test?

 Hi,

 Secondary index lookups are more complicated than normal queries so will
be slower. Items have to first be queried in the index, then retrieved from
their actual location. Also, inserting into indexed CFs will be slower (but
will get substantially faster in 1.2 due

Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Edward Capriolo

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu
wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:

http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42

Re: Help on MMap of SSTables

2012-12-13 Thread Edward Capriolo

This issue has to be looked from a micro and macro level. On the microlevel
the best way is workload specific. On the macro level this mostly boils
down to data and memory size.

Companions are going to churn cache, this is unavoidable. Imho solid state
makes the micro optimization meanless in the big picture. Not that we
should not consider tweaking flags but just saying it is hard to believe
anything like that is a game change.

On Monday, December 10, 2012, Rob Coli rc...@palominodb.com wrote:
 On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com
wrote:
 So for memory mapped files, compaction can do a madvise SEQUENTIAL
instead
 of current DONTNEED flag after detecting appropriate OS versions. Will
this
 help?


 AFAIK Compaction does use memory mapped file access.

 The history :

 https://issues.apache.org/jira/browse/CASSANDRA-1470

 =Rob

 --
 =Robert Coli
 AIMGTALK - rc...@palominodb.com
 YAHOO - rcoli.palominob
 SKYPE - rcoli_palominodb

Re: Why Secondary indexes is so slowly by my test?

2012-12-13 Thread Edward Capriolo

Here is a good start.

http://www.anuff.com/2011/02/indexing-in-cassandra.html

On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi Edward, can you share the link to this blog ?

 Alain

 2012/12/13 Edward Capriolo edlinuxg...@gmail.com

 Ed ENuff s

Re: Read operations resulting in a write?

2012-12-17 Thread Edward Capriolo

Is there a way to turn this on and off through configuration? I am not
necessarily sure I would want this feature. Also it is confusing if these
writes show up in JMX and look like user generated write operations.


On Mon, Dec 17, 2012 at 10:01 AM, Mike mthero...@yahoo.com wrote:

  Thank you Aaron, this was very helpful.

 Could it be an issue that this optimization does not really take effect
 until the memtable with the hoisted data is flushed?  In my simple example
 below, the same row is updated and multiple selects of the same row will
 result in multiple writes to the memtable.  It seems it maybe possible
 (although unlikely) where, if you go from a write-mostly to a read-mostly
 scenario, you could get into a state where you are stuck rewriting to the
 same memtable, and the memtable is not flushed because it absorbs the
 over-writes.  I can foresee this especially if you are reading the same
 rows repeatedly.

 I also noticed from the codepaths that if Row caching is enabled, this
 optimization will not occur.  We made some changes this weekend to make
 this column family more suitable to row-caching and enabled row-caching
 with a small cache.  Our initial results is that it seems to have corrected
 the write counts, and has increased performance quite a bit.  However, are
 there any hidden gotcha's there because this optimization is not
 occurring?  https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions
 a compaction is behind problem.  Any history on that?  I couldn't find
 too much information on it.

 Thanks,
 -Mike

 On 12/16/2012 8:41 PM, aaron morton wrote:


   1) Am I reading things correctly?

 Yes.
 If you do a read/slice by name and more than min compaction level nodes
 where read the data is re-written so that the next read uses fewer SSTables.

  2) What is really happening here?  Essentially minor compactions can
 occur between 4 and 32 memtable flushes.  Looking through the code, this
 seems to only effect a couple types of select statements (when selecting a
 specific column on a specific key being one of them). During the time
 between these two values, every select statement will perform a write.

 Yup, only for readying a row where the column names are specified.
 Remember minor compaction when using SizedTiered Compaction (the default)
 works on buckets of the same size.

  Imagine a row that had been around for a while and had fragments in more
 than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd
 tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads.
 If this row is read it will get hoisted back up.

  But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st
 tier it will not hoisted.

  There are a few short circuits in the SliceByName read path. One of them
 is to end the search when we know that no other SSTables contain columns
 that should be considered. So if the 4 columns you read frequently are
 hoisted into the 1st bucket your reads will get handled by that one bucket.

  It's not every select. Just those that touched more the min compaction
 sstables.


  3) Is this desired behavior?  Is there something else I should be
 looking at that could be causing this behavior?

 Yes.
 https://issues.apache.org/jira/browse/CASSANDRA-2503

  Cheers


-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

  @aaronmorton
 http://www.thelastpickle.com

  On 15/12/2012, at 12:58 PM, Michael Theroux mthero...@yahoo.com wrote:

  Hello,

  We have an unusual situation that I believe I've reproduced, at least
 temporarily, in a test environment.  I also think I see where this issue is
 occurring in the code.

  We have a specific column family that is under heavy read and write load
 on a nightly basis.   For the purposes of this description, I'll refer to
 this column family as Bob.  During this nightly processing, sometimes Bob
 is under very write load, other times it is very heavy read load.

  The application is such that when something is written to Bob, a write
 is made to one of two other tables.  We've witnessed a situation where the
 write count on Bob far outstrips the write count on either of the other
 tables, by a factor of 3-10.  This is based on the WriteCount available on
 the column family JMX MBean.  We have not been able to find where in our
 code this is happening, and we have gone as far as tracing our CQL calls to
 determine that the relationship between Bob and the other tables are what
 we expect.

  I brought up a test node to experiment, and see a situation where, when
 a select statement is executed, a write will occur.

  In my test, I perform the following (switching between nodetool and
 cqlsh):

  update bob set 'about'='coworker' where key='hex key';
 nodetool flush
  update bob set 'about'='coworker' where key='hex key';
 nodetool flush
  update bob set 'about'='coworker' where key='hex key';
 nodetool flush
  update bob set 'about'='coworker' where key='hex

Re: rpc_timeout exception while inserting

2012-12-18 Thread Edward Capriolo

CQL2 and CQL3 indexes are not compatible. I guess CQL2 is able to detect
that the table was defined in CQL3 probably should not allow it. Backwards
comparability is something the storage engines and interfaces have to
account for. At least they should prevent you from hurting yourself.

But do not try to defeat the system. Just stick with one CQL version.

On Tue, Dec 18, 2012 at 7:37 AM, Abhijit Chanda
abhijit.chan...@gmail.comwrote:

 I was trying to mix CQL2 and CQL3 to check whether a columnfamily with
 compound keys can be further indexed. Because using CQL3 secondary indexing
 on table with composite PRIMARY KEY is not possible. And surprisingly by
 mixing the CQL versions i was able to do so. But when i want to insert
 anything in the column family it gives me a rpc_timeout exception. I
 personally found it quite abnormal, so thought of posting this thing in
 forum.


 Best,

 On Mon, Dec 10, 2012 at 6:29 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Mon, Dec 10, 2012 at 12:36 PM, Abhijit Chanda 
 abhijit.chan...@gmail.com wrote:

 Hi All,

 I have a column family which structure is

 CREATE TABLE practice (
   id text,
   name text,
   addr text,
   pin text,
   PRIMARY KEY (id, name)
 ) WITH
   comment='' AND
   caching='KEYS_ONLY' AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   replicate_on_write='true' AND
   compaction_strategy_class='SizeTieredCompactionStrategy' AND
   compression_parameters:sstable_compression='SnappyCompressor';

 CREATE INDEX idx_address ON practice (addr);

 Initially i have made the column family using CQL 3.0.0. Then for
 creating the index i have used CQL 2.0.

 Now when want to insert any data in the column family it always shows  a
 timeout exception.
 INSERT INTO practice (id, name, addr,pin) VALUES (
 '1','AB','kolkata','700052');
 Request did not complete within rpc_timeout.



 Please suggest me where i am getting wrong?


 That would be creating the index through CQL 2. Why did you use CQL 3 for
 the CF creation
 and CQL 2 for the index one? If you do both in CQL 3, that should work as
 expected.

 That being said, you should probably not get timeouts (that won't do what
 you want though).
 If you look at the server log, do you have an exception there?

 --
 Sylvain




 --
 Abhijit Chanda
 Analyst
 VeHere Interactive Pvt. Ltd.
 +91-974395

Re: Monitoring the number of client connections

2012-12-19 Thread Edward Capriolo

In the TCP mib for SNMP (Simple Network Management Protocol) this
information is available

http://www.simpleweb.org/ietf/mibs/mibSynHiLite.php?category=IETFmodule=TCP-MIB

On Wed, Dec 19, 2012 at 12:22 AM, Michael Kjellman
mkjell...@barracuda.comwrote:

 netstat + cron is your friend at this point in time

 On Dec 18, 2012, at 8:25 PM, aaron morton aa...@thelastpickle.com
 wrote:

 AFAIK the count connections is not exposed.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/12/2012, at 10:37 PM, Tomas Nunez tomas.nu...@groupalia.com wrote:

 Hi!

 I want to know how many client connections has each one of my cluster
 nodes (to check if my load balancing is spreading in a balanced way, to
 check if increase in the cluster load can be related to an increase in the
 number of connections, and things like that). I was thinking about going
 with netstat, counting ESTABLISHED connections to port 9160, but then I
 thought maybe there is some way in cassandra to get that information (maybe
 a counter of connections in the JMX?). I've tried installing MX4J and going
 over all MBeans, but I haven't found any with a promising name, they all
 seem unrelated to this information. And I can't find anything skimming the
 manual, so...

 Can you think a better way than netstat to get this information? Better
 yet, is there anything similar to Show processlist in mysql?

 Thanks!

 --
 groupalia.jpg http://es.groupalia.com/
 www.groupalia.com http://es.groupalia.com/ Tomàs Núñez IT-Sysprod Tel. +
 34 93 159 31 00 Fax. + 34 93 396 18 52 Llull, 95-97, 2º planta, 08005
 BarcelonaSkype: tomas.nunez.groupalia 
 tomas.nu...@groupalia.comnombre.apell...@groupalia.com
 twitter.png Twitter http://twitter.com/#%21/groupaliaes
 facebook.png Facebook https://www.facebook.com/GroupaliaEspana
 linkedin.png Linkedin http://www.linkedin.com/company/groupalia



 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f

Re: thrift client can't add a column back after it was deleted with cassandra-cli?

2012-12-21 Thread Edward Capriolo

The cli using microsecond precision your client might be using something
else and the insert with lower timestamps are dropped.

On Friday, December 21, 2012, Qiaobing Xie qiaobing@gmail.com wrote:
 Hi,

 I am developing a thrift client that inserts and removes columns from a
column-family (using batch_mutate calls). Everything seems to be working
fine - my thrift client can add/retrieve/delete/add back columns as
expected... until I manually deleted a column with cassandra-cli. (I was
trying to test an error scenario in which my client would discover a
missing column and recreated it in the column-family). After I deleted a
column from within cassandra-cli manually, my thrift client detected the
column of that name missing when it tried to get it. So it tried to
recreated a new column with that name along with a bunch of other columns
with a batch_mutate call. The call returned normally and the other columns
were added/updated, but the one that I manually deleted from cassandra-cli
was not added/created in the column family.

 I tried to restart my client and cassandra-cli but it didn't help. It
just seemed that my thrift client could no longer add a column with that
name! Finally I destroyed and recreated the whole column-family and the
problem went away.

 Any idea what I did wrong?

 -Qiaobing

Re: Correct way to design a cassandra database

2012-12-21 Thread Edward Capriolo

You could store the order as the first part of a composite string say first
picture as A and second as B. To insert one between call it AA. If you
shuffle alot the strings could get really long.

Might be better to store the order in a separate column.

Neither solution mentioned deals with concurrent access well.

On Friday, December 21, 2012, Adam Venturella aventure...@gmail.com wrote:
 One more link that might be helpful. It's a similar system to photo's but
instead of Photos/Albums it's Songs/Playlists:
 http://www.datastax.com/dev/blog/cql3-for-cassandra-experts.

 It's not exactly 1:1 but it covers related concepts in making it work.


 On Fri, Dec 21, 2012 at 8:02 AM, Adam Venturella aventure...@gmail.com
wrote:

 Ok.. So here is my latest thinking... Including that index:
 CREATE TABLE Users (
 user_name text,
 password text,
 PRIMARY KEY (user_name)
 );
 ^ Same as before
 CREATE TABLE Photos(
 user_name text,
 photo_id uuid,
 created_time timestamp,
 data text,
 PRIMARY KEY (user_name, photo_id, created_time)
 ) WITH CLUSTERING ORDER BY (created_time DESC);
 ^ Note the addition of a photo id and using that in the PK def with the
created_time
 Data is a JSON like this:
 {
 thumbnail: url,
 standard_resolution:url
 }

 CREATE TABLE PhotosAlbums (
 user_name text,
 album_name text,
 poster_image_url text,
 data text
 PRIMARY KEY (user_name, album_name)
 );
 ^ Same as before, data represents a JSON array of the photos:
 [{photo_id:..., thumbnail:url, standard_resolution:url},
 {photo_id:..., thumbnail:url, standard_resolution:url},
 {photo_id:..., thumbnail:url, standard_resolution:url},
 {photo_id:..., thumbnail:url, standard_resolution:url}]

 CREATE TABLE PhotosAlbumsIndex (
 user_name text,
 photo_id uuid,
 album_name text,
 created_time timestamp
 PRIMARY KEY (user_name, photo_id, album_name)
 );
 The create_time column here is because you need to have at least 1 column
that is not part of the PK. Or that's what it looks like in my quick test.
 ^ Each photo added to an album needs to be added to this index row

 As before, your application will need to keep the order of the array in
tact as your users modify the order of things. Now however if they delete a
photo you need to fetch the PhotoAlbums the photo existed in and update
them accordingly:
 SELECT * FROM PhotosAlbumsIndex WHERE user_name='the_user' AND
photo_id=uuid
 This should return to you all of the albums that the photo was a part of.
Now you need to:
 SELECT * FROM PhotosAlbums where user_name = the_user and album_name IN

Re: State of Cassandra and Java 7

2012-12-23 Thread Edward Capriolo

This what versions are supported is kinda up to you for example earlier
versions of jdk now have bugs. I have a version of java 1.6.0_23 I believe
that will not even start with the latest cassandra releases. Likewise
people suggest not running the newest ones 1.7.0 because they have not
tested it.

So there is not a definitive version that is the best. If you having
problems and your version is older someone will say upgrade, if your newest
version is not working someone will say downgrade. No one trusts a just
released version. Generally this means try to keep a few months behind the
curb.

As with most things in c* you can run different versions on different
nodes, you are not forced into an all at once upgrade.

On Sun, Dec 23, 2012 at 4:37 AM, Fabrice Facorat
fabrice.faco...@gmail.comwrote:

 At Orange portails we are presently testing Cassandra 1.2.0 beta/rc
 with Java 7, and presnetly we have no issues

 2012/12/22 Brian Tarbox tar...@cabotresearch.com:
  What I saw in all cases was
  a) set JAVA_HOME to java7, run program fail
  b) set JAVA_HOME to java6, run program success
 
  I should have better notes but I'm at a 6 person startup so working tools
  gets used and failing tools get deleted.
 
  Brian
 
 
  On Fri, Dec 21, 2012 at 3:54 PM, Bryan Talbot btal...@aeriagames.com
  wrote:
 
  Brian, did any of your issues with java 7 result in corrupting data in
  cassandra?
 
  We just ran into an issue after upgrading a test cluster from Cassandra
  1.1.5 and Oracle JDK 1.6.0_29-b11 to Cassandra 1.1.7 and 7u10.
 
  What we saw is values in columns with validation
  Class=org.apache.cassandra.db.marshal.LongType that were proper integers
  becoming corrupted so that they become stored as strings.  I don't have
 a
  reproducible test case yet but will work on making one over the holiday
 if I
  can.
 
  For example, a column with a long type that was originally written and
  stored properly (say with value 1200) was somehow changed during
 cassandra
  operations (compaction seems the only possibility) to be the value
 '1200'
  with quotes.
 
  The data was written using the phpcassa library and that application and
  library haven't been changed.  This has only happened on our test
 cluster
  which was upgraded and hasn't happened on our live cluster which was not
  upgraded.  Many of our column families were affected and all affected
  columns are Long (or bigint for cql3).
 
  Errors when reading using CQL3 command client look like this:
 
  Failed to decode value '1356441225' (for column 'expires') as bigint:
  unpack requires a string argument of length 8
 
  and when reading with cassandra-cli the error is
 
  [default@cf] get
 
 token['fbc1e9f7cc2c0c2fa186138ed28e5f691613409c0bcff648c651ab1f79f9600b'];
  = (column=client_id, value=8ec4c29de726ad4db3f89a44cb07909c04f90932d,
  timestamp=1355836425784329, ttl=648000)
  A long is exactly 8 bytes: 10
 
 
 
 
  -Bryan
 
 
 
 
 
  On Mon, Dec 17, 2012 at 7:33 AM, Brian Tarbox tar...@cabotresearch.com
 
  wrote:
 
  I was using jre-7u9-linux-x64  which was the latest at the time.
 
  I'll confess that I did not file any bugs...at the time the advice from
  both the Cassandra and Zookeeper lists was to stay away from Java 7
 (and my
  boss had had enough of my reporting that the problem was Java 7 for
 me to
  spend a lot more time getting the details).
 
  Brian
 
 
  On Sun, Dec 16, 2012 at 4:54 AM, Sylvain Lebresne 
 sylv...@datastax.com
  wrote:
 
  On Sat, Dec 15, 2012 at 7:12 PM, Michael Kjellman
  mkjell...@barracuda.com wrote:
 
  What issues have you ran into? Actually curious because we push
  1.1.5-7 really hard and have no issues whatsoever.
 
 
  A related question is which which version of java 7 did you try? The
  first releases of java 7 were apparently famous for having many
 issues but
  it seems the more recent updates are much more stable.
 
  --
  Sylvain
 
 
  On Dec 15, 2012, at 7:51 AM, Brian Tarbox 
 tar...@cabotresearch.com
  wrote:
 
  We've reverted all machines back to Java 6 after running into
 numerous
  Java 7 issues...some running Cassandra, some running Zookeeper,
 others just
  general problems.  I don't recall any other major language release
 being
  such a mess.
 
 
  On Fri, Dec 14, 2012 at 5:07 PM, Bill de hÓra b...@dehora.net
 wrote:
 
  At least that would be one way of defining officially supported.
 
  Not quite, because, Datastax is not Apache Cassandra.
 
  the only issue related to Java 7 that I know of is CASSANDRA-4958,
  but that's osx specific (I wouldn't advise using osx in production
 anyway)
  and it's not directly related to Cassandra anyway so you can easily
 use the
  beta version of snappy-java as a workaround if you want to. So that
 non
  blocking issue aside, and as far as we know, Cassandra supports
 Java 7. Is
  it rock-solid in production? Well, only repeated use in production
 can tell,
  and that's not really in the hand of the project.
 
  Exactly right. If enough people use Cassandra on

Re: how to create a keyspace in CQL3

2012-12-23 Thread Edward Capriolo

Unfortunately one of the first command everyone needs to use to use to work
with cassandra changes very often.

You can use

cqlsh help create_keyspace;

But some times even the documentation is not in line.

Using this permutation of goodness:

cqlsh 2.3.0 | Cassandra 1.2.0-beta2-SNAPSHOT | CQL spec 3.0.0 | Thrift
protocol 19.35.0]

The syntax is as follows:
cqlsh create keyspace a with replication = {'class':'SimpleStrategy',
'replication_factor':3};




On Sun, Dec 23, 2012 at 10:15 AM, Manu Zhang owenzhang1...@gmail.comwrote:

 I'm wondering why the following command to create a keyspace in CQL3 fails. 
 It is same as the sample in the doc 
 http://cassandra.apache.org/doc/cql3/CQL.html

 CREATE KEYSPACE demodb
WITH strategy_class = SimpleStrategy
 AND strategy_options:replication_factor = 1;


 I'm using Cassandra1.2-beta2

Re: Force data to a specific node

2013-01-02 Thread Edward Capriolo

There is a crazy, very bad, don't do it way to do this. You can set RF=1
and hack the LocalPartitioner (because the local partitioner has been
made not to do this)

Then the node you connect to and write is the node the data will get stored
on.

Its like memcache do it yourself style sharding.

Did I say not suggested.

If not not suggested


On Wed, Jan 2, 2013 at 2:54 PM, Aaron Turner synfina...@gmail.com wrote:

 You'd have to use the ordered partitioner or something like that and
 choose your row key according to the node you want it placed.

 But that's in general a really bad idea because you end up with
 unbalanced nodes and hot spots.

 That said, are your nodes on a LAN?  I have my 9+3 node cluster (two
 datacenters) on 100Mbps ports (which everyone says not to do) and it's
 working just fine.  Even node rebuilds haven't been that bad so far.
 If you're trying to avoid WAN replication, then use a dedicated
 cluster.

 On Wed, Jan 2, 2013 at 10:20 AM, Everton Lima peitin.inu...@gmail.com
 wrote:
  We need to do this to minimize the network I/O. We have our own load data
  balance algorithm. We have some data that is best to process in a local
  machine.
  Is it possible? How?
 
 
  2013/1/2 Edward Sargisson edward.sargis...@globalrelay.net
 
  Why would you want to?
 
  
  From: Everton Lima peitin.inu...@gmail.com
  To: Cassandra-User user@cassandra.apache.org
  Sent: Wed Jan 02 18:03:49 2013
  Subject: Force data to a specific node
 
  It is possible to force a data to stay in a specific node?
 
  --
  Everton Lima Aleixo
  Bacharel em Ciência da Computação pela UFG
  Mestrando em Ciência da Computação pela UFG
  Programador no LUPA
 
 
 
 
  --
  Everton Lima Aleixo
  Bacharel em Ciência da Computação pela UFG
  Mestrando em Ciência da Computação pela UFG
  Programador no LUPA
 



 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero

Re: RandomPartitioner to Murmur3Partitioner

2013-01-03 Thread Edward Capriolo

By the way 10% faster does not necessarily mean 10% more requests.

https://issues.apache.org/jira/browse/CASSANDRA-2975

https://issues.apache.org/jira/browse/CASSANDRA-3772

Also if you follow the tickets My tests show that Murmur3Partitioner
actually is worse than MD5 with high cardinality indexes, here is what I
did (kernel 3.0.0-19, 2.2Ghz quad-core Opteron, 2GB RAM):

For each test:
wiped all of the data directories and re-compiled with 'clean'
ran stress with -c 50 -C 500 -S 512 -n 5 (where -c is number of
columns, -C values cardinality and -S is value size in bytes) 4 times (to
make it hot)

RandomPartitioner: average op rate is 845.
 Murmur3Partitioner: average op rage is 721.

Then later:

I have removed ThreadLocal declaration from the M3P (and cleaned
whitespace errors) which was the bottleneck, after re-running tests with
that modification M3P beats RP with 903 to 847.


847/903 = 0.937984496

I think that I is% 6 or 7% right?, not 10%, and other things in cassandra
are orders or magnitude slower then computing hashes, network, diskio. Also
is this test only testing when using 2ndary indexes? What about people who
do not care about 2ndard indexes. I am sure it is faster and better, but I
am not going to lose sleep until I rebuild all my clusters just to change
the partitioner. So new clusters I will probably use the default but not
going to upgrade existing ones. Let them stay RP.

Edward

On Thu, Jan 3, 2013 at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hello I have read the following from the changes.txt file.

 The default partitioner for new clusters is Murmur3Partitioner,
 which is about 10% faster for index-intensive workloads.  Partitioners
 cannot be changed once data is in the cluster, however, so if you are
 switching to the 1.2 cassandra.yaml, you should change this to
 RandomPartitioner or whatever your old partitioner was.

 Does this mean that there absolutely no way to switch to the new
 partitioner for people that are already using Cassandra ?

 Alain

Re: Error after 1.2.0 upgrade

2013-01-03 Thread Edward Capriolo

Just a shot in the dark, but I would try setting -Xss higher then the
default. It's probably like 180, but I cant even start at that level,
bumped it up to 256 for JDK 7.

On Thu, Jan 3, 2013 at 12:02 PM, Michael Kjellman
mkjell...@barracuda.comwrote:

 :) yes, I'm crazy

 The assertion appears to be compiled code which is why I was guessing jna.

 Biggest issue right now is that upgraded 1.2.0 nodes only see other 1.2.0
 nodes in the ring. 1.1.7 nodes don't see the 1.2.0 nodes..

 Upgrading every node to 1.2.0 now lists all nodes in the ring...

 On Jan 3, 2013, at 8:57 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Wow, so you're going live with 1.2.0, good luck with that. When it's done,
 whould you mind letting me know if everything went fine or if you have some
 advice or feedback?

 This looks related to JNA?

 Does it ? The only thing logged about JNA is the following : JNA
 mlockall successful.

 What does this line *** java.lang.instrument ASSERTION FAILED ***:
 !errorOutstanding with message transform method call failed at
 ../../../src/share/instrument/JPLISAgent.c line: 806 means?


 2013/1/3 Michael Kjellman mkjell...@barracuda.com

 I'm having huge upgrade issues from 1.1.7 - 1.2.0 atm but in a 12 node
 cluster which I am slowly massaging into a good state I haven't seen this
 in 15+ hours of operation…

 This looks related to JNA?

 From: Alain RODRIGUEZ arodr...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, January 3, 2013 8:42 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Error after 1.2.0 upgrade

 In a dev env, C* 1.1.7 - 1.2.0, 1 node.

 I run Cassandra in a 8GB memory environment.

 The upgrade went well, but I sometimes have the following error:

 INFO 17:31:04,143 Node /192.168.100.201 state jump to normal
  INFO 17:31:04,149 Enqueuing flush of Memtable-local@1654799672(32/32
 serialized/live bytes, 2 ops)
  INFO 17:31:04,149 Writing Memtable-local@1654799672(32/32
 serialized/live bytes, 2 ops)
  INFO 17:31:04,371 Completed flushing
 /home/stockage/cassandra/data/system/local/system-local-ia-12-Data.db (91
 bytes) for commitlog position ReplayPosition(segmentId=1357230649515,
 position=49584)
  INFO 17:31:04,376 Startup completed! Now serving reads.
  INFO 17:31:04,798 Compacted to
 [/var/lib/cassandra/data/system/local/system-local-ia-13-Data.db,].  950 to
 471 (~49% of original) bytes for 1 keys at 0,000507MB/s.  Time: 886ms.
  INFO 17:31:04,889 mx4j successfuly loaded
 HttpAdaptor version 3.0.2 started on port 8081
  INFO 17:31:04,967 Not starting native transport as requested. Use JMX
 (StorageService-startNativeTransport()) to start it
  INFO 17:31:04,980 Binding thrift service to /0.0.0.0:9160
  INFO 17:31:05,007 Using TFramedTransport with a max frame size of
 15728640 bytes.
  INFO 17:31:09,964 Using synchronous/threadpool thrift server on 0.0.0.0
 : 9160
  INFO 17:31:09,965 Listening for thrift clients...
 *** java.lang.instrument ASSERTION FAILED ***: !errorOutstanding with
 message transform method call failed at
 ../../../src/share/instrument/JPLISAgent.c line: 806
 ERROR 17:33:56,002 Exception in thread Thread[Thrift:1702,5,main]
 java.lang.StackOverflowError
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 at java.io.BufferedInputStream.fill(Unknown Source)
 at java.io.BufferedInputStream.read1(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
 at
 org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
 at
 org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
 at
 org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
 at
 org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
 Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
 Source)
 at java.lang.Thread.run(Unknown Source)

 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f
   



 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:

Re: Error after 1.2.0 upgrade

2013-01-03 Thread Edward Capriolo

The only true drain is
1) turn on ip tables to stop all incoming traffic
2) flush
3) wait
4) delete files
5) upgrade
6) restart


On Thu, Jan 3, 2013 at 2:59 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 That's why I didn’t create a ticket as I knew there was one. But, I
 thought this had been fixed in 1.1.7 ??

 From: Edward Capriolo edlinuxg...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, January 3, 2013 11:57 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Error after 1.2.0 upgrade

 There is a bug on this, drain has been in a weird state for a long time.
 In 1.0 it did not work labeled as a known limitation.

 https://issues.apache.org/jira/browse/CASSANDRA-4446



 On Thu, Jan 3, 2013 at 2:49 PM, Michael Kjellman 
 mkjell...@barracuda.comwrote:

 Another thing: for those that use counters this might be a problem.

 I always do a nodetool drain before upgrading a node (as is good practice
 btw). However, in every case on every one of my nodes, the commit log was
 replayed on each node and mutations were created. Could lead to double
 counting of counters…

 No bug for that yet

 Best,
 Micahel

 From: Michael Kjellman mkjell...@barracuda.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, January 3, 2013 11:42 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Error after 1.2.0 upgrade

 Tracking Issues:

 https://issues.apache.org/jira/browse/CASSANDRA-5101
 https://issues.apache.org/jira/browse/CASSANDRA-5104 which was created
 because of https://issues.apache.org/jira/browse/CASSANDRA-5103
 https://issues.apache.org/jira/browse/CASSANDRA-5102

 Also friendly reminder to all that cql2 created indexes will not work
 with cql3. You need to drop them and recreate in cql3, otherwise you'll see
 rpc_timeout issues.

 I'll update with more issues as I see them. The fun bugs never happen in
 your dev environment do they :)

 From: aaron morton aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, January 3, 2013 11:38 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Error after 1.2.0 upgrade

 Michael,
 Could you share some of your problems ? May be of help for others.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 4/01/2013, at 5:45 AM, Michael Kjellman mkjell...@barracuda.com
 wrote:

 I'm having huge upgrade issues from 1.1.7 - 1.2.0 atm but in a 12 node
 cluster which I am slowly massaging into a good state I haven't seen this
 in 15+ hours of operation…

 This looks related to JNA?

 From: Alain RODRIGUEZ arodr...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thursday, January 3, 2013 8:42 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Error after 1.2.0 upgrade

 In a dev env, C* 1.1.7 - 1.2.0, 1 node.

 I run Cassandra in a 8GB memory environment.

 The upgrade went well, but I sometimes have the following error:

 INFO 17:31:04,143 Node /192.168.100.201 state jump to normal
  INFO 17:31:04,149 Enqueuing flush of Memtable-local@1654799672(32/32
 serialized/live bytes, 2 ops)
  INFO 17:31:04,149 Writing Memtable-local@1654799672(32/32
 serialized/live bytes, 2 ops)
  INFO 17:31:04,371 Completed flushing
 /home/stockage/cassandra/data/system/local/system-local-ia-12-Data.db (91
 bytes) for commitlog position ReplayPosition(segmentId=1357230649515,
 position=49584)
  INFO 17:31:04,376 Startup completed! Now serving reads.
  INFO 17:31:04,798 Compacted to
 [/var/lib/cassandra/data/system/local/system-local-ia-13-Data.db,].  950 to
 471 (~49% of original) bytes for 1 keys at 0,000507MB/s.  Time: 886ms.
  INFO 17:31:04,889 mx4j successfuly loaded
 HttpAdaptor version 3.0.2 started on port 8081
  INFO 17:31:04,967 Not starting native transport as requested. Use JMX
 (StorageService-startNativeTransport()) to start it
  INFO 17:31:04,980 Binding thrift service to /0.0.0.0:9160
  INFO 17:31:05,007 Using TFramedTransport with a max frame size of
 15728640 bytes.
  INFO 17:31:09,964 Using synchronous/threadpool thrift server on 0.0.0.0
 : 9160
  INFO 17:31:09,965 Listening for thrift clients...
 *** java.lang.instrument ASSERTION FAILED ***: !errorOutstanding with
 message transform method call failed at
 ../../../src/share/instrument/JPLISAgent.c line: 806
 ERROR 17:33:56,002 Exception in thread Thread[Thrift:1702,5,main]
 java.lang.StackOverflowError
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 at java.io.BufferedInputStream.fill(Unknown Source)
 at java.io.BufferedInputStream.read1(Unknown Source)
 at java.io.BufferedInputStream.read(Unknown Source)
 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127

Re: Specifying initial token in 1.2 fails

2013-01-04 Thread Edward Capriolo

Yes. They were really just introduced and if you are ready to hitch your
wagon to every new feature you put yourself in considerable risk. With any
piece of software not just Cassandra.

On Fri, Jan 4, 2013 at 11:59 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 But I don't really get the point of starting a new cluster without
 vnodes... Is there some disadvantage using vnodes ?

 Alain


 2013/1/4 Nick Bailey n...@datastax.com

 If you are planning on using murmur3 without vnodes (specifying your own
 tokens) there is a quick python script in the datastax docs you can use to
 generate balanced tokens.


 http://www.datastax.com/docs/1.2/initialize/token_generation#calculating-tokens-for-the-murmur3partitioner


 On Fri, Jan 4, 2013 at 10:53 AM, Michael Kjellman 
 mkjell...@barracuda.com wrote:

 To be honest I haven't run a cluster with Murmur3.

 You can still use indexing with RandomPartitioner (all us old folk are
 stuck on Random btw..)

 And there was a thread floating around yesterday where Edward did some
 benchmarks and found that Murmur3 was actually slower than
 RandomPartitioner.

 http://www.mail-archive.com/user@cassandra.apache.org/msg26789.htmlhttp://permalink.gmane.org/gmane.comp.db.cassandra.user/30182

 I do know that with vnodes token allocation is now 100% dynamic so no
 need to manually assign tokens to nodes anymore.

 Best,
 michael

 From: Dwight Smith dwight.sm...@genesyslab.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Friday, January 4, 2013 8:48 AM
 To: 'user@cassandra.apache.org' user@cassandra.apache.org
 Subject: RE: Specifying initial token in 1.2 fails

 Michael

 ** **

 Yes indeed – my mistake.  Thanks.  I can specify RandomPartitioner,
 since I do not use indexing – yet.

 ** **

 Just for informational purposes – with Murmur3 - to achieve a balanced
 cluster – is the initial token method supported?

 If so – how should these be generated, the token-generator seems to only
 apply to RandomPartitioner.

 ** **

 Thanks again

 ** **

 *From:* Michael Kjellman 
 [mailto:mkjell...@barracuda.commkjell...@barracuda.com]

 *Sent:* Friday, January 04, 2013 8:39 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Specifying initial token in 1.2 fails

 ** **

 Murmur3 != MD5 (RandomPartitioner)

 ** **

 *From: *Dwight Smith dwight.sm...@genesyslab.com
 *Reply-To: *user@cassandra.apache.org user@cassandra.apache.org
 *Date: *Friday, January 4, 2013 8:36 AM
 *To: *'user@cassandra.apache.org' user@cassandra.apache.org
 *Subject: *Specifying initial token in 1.2 fails

 ** **

 Hi

  

 Just started evaluating 1.2 – starting a clean Cassandra node – the
 usual practice is to specify the initial token – but when I attempt to
 start the node the following is observed:

  

 INFO [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 203)
 disk_failure_policy is stop

 DEBUG [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 205)
 page_cache_hinting is false

 INFO [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 266)
 Global memtable threshold is enabled at 339MB

 DEBUG [main] 2013-01-03 14:08:58,008 DatabaseDescriptor.java (line 381)
 setting auto_bootstrap to true

 ERROR [main] 2013-01-03 14:08:58,024 DatabaseDescriptor.java (line 495)
 Fatal configuration error

 org.apache.cassandra.exceptions.ConfigurationException: For input
 string: 85070591730234615865843651857942052863

 at
 org.apache.cassandra.dht.Murmur3Partitioner$1.validate(Murmur3Partitioner.java:180)
 

 at
 org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:433)
 

 at
 org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:121)
 

 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:178)
 

 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:397)
 

 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:440)
 

  

  

 This looks like a bug.

  

 Thanks

  

  

 ** **

 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f 

     

 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f

Re: help turning compaction..hours of run to get 0% compaction....

2013-01-07 Thread Edward Capriolo

There is some point where you simply need more machines.

On Mon, Jan 7, 2013 at 5:02 PM, Michael Kjellman mkjell...@barracuda.comwrote:

 Right, I guess I'm saying that you should try loading your data with
 leveled compaction and see how your compaction load is.

 Your work load sounds like leveled will fit much better than size tiered.

 From: Brian Tarbox tar...@cabotresearch.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Monday, January 7, 2013 1:58 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: help turning compaction..hours of run to get 0%
 compaction

 The problem I see is that it already takes me more than 24 hours just to
 load my data...during which time the logs say I'm spending tons of time
 doing compaction.  For example in the last 72 hours I'm consumed* 20 hours
 * per machine on compaction.

 Can I conclude from that than I should be (perhaps drastically) increasing
 my compaction_mb_per_sec on the theory that I'm getting behind?

 The fact that it takes me 3 days or more to run a test means its hard to
 just play with values and see what works best, so I'm trying to understand
 the behavior in detail.

 Thanks.

 Brain


 On Mon, Jan 7, 2013 at 4:13 PM, Michael Kjellman 
 mkjell...@barracuda.comwrote:

 http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

 If you perform at least twice as many reads as you do writes, leveled
 compaction may actually save you disk I/O, despite consuming more I/O for
 compaction. This is especially true if your reads are fairly random and
 don’t focus on a single, hot dataset.

 From: Brian Tarbox tar...@cabotresearch.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Monday, January 7, 2013 12:56 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: help turning compaction..hours of run to get 0%
 compaction

 I have not specified leveled compaction so I guess I'm defaulting to size
 tiered?  My data (in the column family causing the trouble) insert once,
 ready many, update-never.

 Brian


 On Mon, Jan 7, 2013 at 3:13 PM, Michael Kjellman mkjell...@barracuda.com
  wrote:

 Size tiered or leveled compaction?

 From: Brian Tarbox tar...@cabotresearch.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Monday, January 7, 2013 12:03 PM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: help turning compaction..hours of run to get 0% compaction

 I have a column family where I'm doing 500 inserts/sec for 12 hours or
 so at time.  At some point my performance falls off a cliff due to time
 spent doing compactions.

 I'm seeing row after row of logs saying that after 1 or 2 hours of
 compactiing it reduced to 100% of 99% of the original.

 I'm trying to understand what direction this data points me to in term
 of configuration change.

a) increase my compaction_throughput_mb_per_sec because I'm falling
 behind (am I falling behind?)

b) enable multi-threaded compaction?

 Any help is appreciated.

 Brian

 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f
   



 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f
   



 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
 http://on.fb.me/UAdL4f

Re: about validity of recipe A node join using external data copy methods

2013-01-08 Thread Edward Capriolo

Basically this recipe is from the old days when we had anti-compaction. Now
streaming is very efficient rarely fails and there is no need to do it this
way anymore. This recipe will be abolished from the second edition. It
still likely works except when using counters.

Edward

On Tue, Jan 8, 2013 at 7:27 AM, DE VITO Dominique 
dominique.dev...@thalesgroup.com wrote:

   Hi,



 Edward Capriolo described in his Cassandra book a faster way [1] to start
 new nodes if the cluster size doubles, from N to 2 *N.



 It's about splitting in 2 parts each token range taken in charge, after
 the split, with 2 nodes: the existing one, and a new one. And for starting
 a new node, one needs to:

 - copy the data records from the corresponding node (without the system
 records)

 - start the new node with auto_bootstrap: false



 This raises 2 questions:



 A) is this recipe still valid with v1.1 and v1.2 ?



 B) do we still need to start the new node with auto_bootstrap: false ?

 My guess is yes as the happening of the bootstrap phase is not recorded
 into the data records.



 Thanks.



 Dominique



 [1] see recipe A node join using external data copy methods, page 165

Re: about validity of recipe A node join using external data copy methods

2013-01-08 Thread Edward Capriolo

It has been true since about 0.8. in the old days ANTI-COMPACTION stunk and
many weird errors would cause node joins to have to be retried N times.

Now node moves/joins seem to work near 100% of the time (in 1.0.7) they are
also very fast and efficient.

If you want to move a node to new hardware you can do it with rsync, but I
would not use the technique for growing the cluster. It is error prone, and
ends up being more work.

On Tue, Jan 8, 2013 at 10:57 AM, DE VITO Dominique 
dominique.dev...@thalesgroup.com wrote:

Now streaming is very efficient rarely fails and there is no need to
 do it this way anymore



 I guess it's true in v1.2.

 Is it true also in v1.1 ?



 Thanks.



 Dominique





 *De :* Edward Capriolo [mailto:edlinuxg...@gmail.com]
 *Envoyé :* mardi 8 janvier 2013 16:01
 *À :* user@cassandra.apache.org
 *Objet :* Re: about validity of recipe A node join using external data
 copy methods



 Basically this recipe is from the old days when we had anti-compaction.
 Now streaming is very efficient rarely fails and there is no need to do it
 this way anymore. This recipe will be abolished from the second edition. It
 still likely works except when using counters.



 Edward



 On Tue, Jan 8, 2013 at 7:27 AM, DE VITO Dominique 
 dominique.dev...@thalesgroup.com wrote:

 Hi,



 Edward Capriolo described in his Cassandra book a faster way [1] to start
 new nodes if the cluster size doubles, from N to 2 *N.



 It's about splitting in 2 parts each token range taken in charge, after
 the split, with 2 nodes: the existing one, and a new one. And for starting
 a new node, one needs to:

 - copy the data records from the corresponding node (without the system
 records)

 - start the new node with auto_bootstrap: false



 This raises 2 questions:



 A) is this recipe still valid with v1.1 and v1.2 ?



 B) do we still need to start the new node with auto_bootstrap: false ?

 My guess is yes as the happening of the bootstrap phase is not recorded
 into the data records.



 Thanks.



 Dominique



 [1] see recipe A node join using external data copy methods, page 165

Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo

I ask myself this every day. CQL3 is new way to do things, including wide
rows with collections. There is no upgrade path. You adopt CQL3's sparse
tables as soon as you start creating column families from CQL. There is not
much backwards compatibility. CQL3 can query compact tables, but you may
have to remove the metadata from them so they can be transposed. Thrift can
not write into CQL tables easily, because of how the primary keys and
column names are encoded into the key column and compact metadata is not
equal to cql3's metadata.

http://www.datastax.com/dev/blog/thrift-to-cql3

For a large swath of problems I like how CQL3 deals with them. For example
you do not really need CQL3 to store a collection in a column family along
side other data. You can use wide rows for this, but the integrated
solution with CQL3 metadata is interesting.

My biggest beefs are:
1) column names are UTF8 (seems wasteful in most cases)
2) sparse empty row to ghost (seems like tiny rows with one column have
much overhead now)
3) using composites (with (compound primary keys) in some table designs) is
wasteful. Composite adds two unsigned bytes for size and one unsigned byte
as 0 per part.
4) many lines of code between user/request and actual disk. (tracing a CQL
select VS a slice, young gen, etc)
5) not sure if collections can be used in REALLY wide row scenarios. aka
1,000,000 entry set?

I feel that in an effort to be nube friendly, sparse+CQL is presented as
the best default option.  However the 5 above items are not minor, and in
several use cases could make CQL's sparse tables a bad choice for certain
applications. Those users would get better performance from compact
storage. I feel that message sometimes gets washed away in all the CQL
coolness. What is that you say? This is not actually the most efficient
way to store this data? Well who cares I can do an IN CLAUSE! WooHoo!


On Wed, Jan 9, 2013 at 12:10 PM, Ben Hood 0x6e6...@gmail.com wrote:

 I'm currently in the process of porting my app from Thrift to CQL3 and it
 seems to me that the underlying storage layout hasn't really changed
 fundamentally. The difference appears to be that CQL3 offers a neater
 abstraction on top of the wide row format. For example, in CQL3, your query
 results are bound to a specific schema, so you get named columns back -
 previously you had to process the slices procedurally. The insert path
 appears to be tighter as well - you don't seem to get away with leaving out
 key attributes.

 I'm sure somebody more knowledgeable can explain this better though.

 Cheers,

 Ben


 On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote:

 We use the thrift bindings for our current production cluster, so I
 haven't been tracking the developments regarding CQL3. I just discovered
 when speaking to another potential DSE customer that wide rows, or rather
 columns not defined in the metadata aren't supported in CQL 3.

 I'm curious to understand the reasoning behind this, whether this is an
 intentional direction shift away from the big table paradigm, and what's
 supposed to happen to those of us who have already bought into
 C* specifically because of the wide row support. What is our upgrade path?

Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo

By no upgrade path I mean to say if I have a table with compact storage I
can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.

Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
You only need one column family you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.

Most people are probably going to store column names as tersely as
possible. Your not going to store password as a multibyte
UTF8(password). You store it as ascii(password). (or really ascii('pw')

Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-

My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.

Basically for any column family: two possible decision paths:

1) use compact
2) use sparse

Other then ease of use why would I chose sparse? Why should it be the
default?

On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 c way. Now I can't pretend knowing what every user is doing, but from
 my experience and what I've seen, this is not such a common thing and CF
 are
 either static or dynamic in nature, not both.

Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo

Also I have to say I do not get that blank sparse column.

Ghost ranges are a little weird but they don't bother me.

1 its a row of nothing. The definition of a waste.

2 suppose of have 1 billion rows and my distribution is mostly rows of 1 or
2 columns. My database is now significantly bigger. That stinks.

3 suppose I write columns frequently. Well do I have to constantly need to
keep writing this sparse empty row? It seems like I would. Worst case each
stable with a write to a rowkey also has this sparse column, meaning
multiple blank empty wasteful columns on disk to solve ghosts, that do not
bother me anyway.

4 are these sparse columns also taking memtable space?

This questions would give me serious pause to use sparse tables





On Wednesday, January 9, 2013, Edward Capriolo edlinuxg...@gmail.com
wrote:
 By no upgrade path I mean to say if I have a table with compact storage
I can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.

 Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
You only need one column family you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.
 Most people are probably going to store column names as tersely as
possible. Your not going to store password as a multibyte
UTF8(password). You store it as ascii(password). (or really ascii('pw')
 Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-
 My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.
 Basically for any column family: two possible decision paths:
 1) use compact
 2) use sparse
 Other then ease of use why would I chose sparse? Why should it be the
default?
 On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 c way. Now I can't pretend knowing what every user is doing, but from
 my experience and what I've seen, this is not such a common thing and CF
are
 either static or dynamic in nature, not both.

Re: Starting Cassandra

2013-01-10 Thread Edward Capriolo

I think 1.6.0_24 is too low and 1.7.0 is too high. Try a more recent 1.6.

I just had problems with 1.6.0_23 see here:

https://issues.apache.org/jira/browse/CASSANDRA-4944

On Thu, Jan 10, 2013 at 10:29 AM, Sloot, Hans-Peter 
hans-peter.sl...@atos.net wrote:

 I have 4 vm's with 1024M memory.
 1 cpu.

 -Origineel bericht-
 Van: Andrea Gazzarini
 Verz.:  10-01-2013, 16:24
 Aan: user@cassandra.apache.org
 Onderwerp: Re: Starting Cassandra

 Hi,
 I'm running Cassandra with 1.6_24 and all it's working, so probably the
 problem is elsewhere. What about your hardware / SO configuration?

 On 01/10/2013 04:19 PM, Sloot, Hans-Peter wrote:
  The java version is 1.6_24.
 
  The manual said that 1.7 was not the best choice.
 
  But I will try it.
 
 
  -Origineel bericht-
  Van: adeel.ak...@panasiangroup.com
  Verz.:  10-01-2013, 16:08
  Aan: user@cassandra.apache.org; Sloot, Hans-Peter
  CC: user@cassandra.apache.org
  Onderwerp: Re: Starting Cassandra
 
  Hi,
 
  Please check java version with (java -version) command and install
  java 7 to resolve this issue.
 
  Regards,
 
  Adeel Akbar
 
  Quoting Sloot, Hans-Peter hans-peter.sl...@atos.net:
 
  Hello,
  Can someone help me out?
  I have installed Cassandra enterprise and followed the cookbook
 
  -  Configured the cassandra.yaml file
 
  -  Configured the cassandra-topoloy.properties file
  But when I try to start the cluster with 'service dse start' nothing
 starts.
  With cassandra -f  I get:
  /usr/sbin/cassandra -f
  xss =  -ea -javaagent:/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
  -XX:ThreadPriorityPolicy=42 -Xms495M -Xmx495M -Xmn100M
  -XX:+HeapDumpOnOutOfMemoryError -Xss180k
  Segmentation fault
  The command cassandra -v
  I get  :
  xss =  -ea -javaagent:/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
  -XX:ThreadPriorityPolicy=42 -Xms495M -Xmx495M -Xmn100M
  -XX:+HeapDumpOnOutOfMemoryError -Xss180k
  1.1.6-dse-p1
 
  Regards Hans-Peter
 
 
 
 
 
  Dit bericht is vertrouwelijk en kan geheime informatie bevatten
  enkel bestemd voor de geadresseerde. Indien dit bericht niet voor u
  is bestemd, verzoeken wij u dit onmiddellijk aan ons te melden en
  het bericht te vernietigen. Aangezien de integriteit van het bericht
niet veilig gesteld is middels verzending via internet, kan Atos
  Nederland B.V. niet aansprakelijk worden gehouden voor de inhoud
  daarvan. Hoewel wij ons inspannen een virusvrij netwerk te hanteren,
geven wij geen enkele garantie dat dit bericht virusvrij is, noch
  aanvaarden wij enige aansprakelijkheid voor de mogelijke
  aanwezigheid van een virus in dit bericht. Op al onze
  rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos
  Nederland B.V. goederen en/of diensten levert zijn met uitsluiting
  van alle andere voorwaarden de Leveringsvoorwaarden van Atos
  Nederland B.V. van toepassing. Deze worden u op aanvraag direct
  kosteloos toegezonden.
 
  This e-mail and the documents attached are confidential and intended
solely for the addressee; it may also be privileged. If you receive
this e-mail in error, please notify the sender immediately and
  destroy it. As its integrity cannot be secured on the Internet, the
  Atos Nederland B.V. group liability cannot be triggered for the
  message content. Although the sender endeavours to maintain a
  computer virus-free network, the sender does not warrant that this
  transmission is virus-free and will not be liable for any damages
  resulting from any virus transmitted. On all offers and agreements
  under which Atos Nederland B.V. supplies goods and/or services of
  whatever nature, the Terms of Delivery from Atos Nederland B.V.
  exclusively apply. The Terms of Delivery shall be promptly submitted
to you on your request.
 
  Atos Nederland B.V. / Utrecht
  KvK Utrecht 30132762
 
 
 
 
 
 
  Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel
 bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd,
 verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te
 vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld
 is middels verzending via internet, kan Atos Nederland B.V. niet
 aansprakelijk worden gehouden voor de inhoud daarvan. Hoewel wij ons
 inspannen een virusvrij netwerk te hanteren, geven wij geen enkele garantie
 dat dit bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid
 voor de mogelijke aanwezigheid van een virus in dit bericht. Op al onze
 rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland
 B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere
 voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing.
 Deze worden u op aanvraag direct kosteloos toegezonden.
 
  This e-mail and the documents attached are confidential and intended
 solely for the addressee; it may also be privileged. If you receive this
 e-mail in error, please notify the sender immediately and destroy

Re: trying to use row_cache (b/c we have hot rows) but nodetool info says zero requests

2013-01-16 Thread Edward Capriolo

You have to change the column family cache info from keys_only to otherwise
the cache will not br on for this cf.

On Wednesday, January 16, 2013, Brian Tarbox tar...@cabotresearch.com
wrote:
 We have quite wide rows and do a lot of concentrated processing on each
row...so I thought I'd try the row cache on one node in my cluster to see
if I could detect an effect of using it.
 The problem is that nodetool info says that even with a two gig row_cache
we're getting zero requests.  Since my client program is actively
processing, and since keycache shows lots of activity I'm puzzled.
 Shouldn't any read of a column cause the entire row to be loaded?
 My entire data file is only 32 gig right now so its hard to imagine the 2
gig is too small to hold even a single row?
 Any suggestions how to proceed are appreciated.
 Thanks.
 Brian Tarbox

Re: Starting Cassandra

2013-01-16 Thread Edward Capriolo

I think at this point cassandra startup scripts should reject versions
since cassandra won't even star with many jvms at this point.

On Tuesday, January 15, 2013, Michael Kjellman mkjell...@barracuda.com
wrote:
 Do yourself a favor and get a copy of the Oracle 7 JDK (now with more
security patches too!)
 On Jan 15, 2013, at 1:44 AM, Sloot, Hans-Peter 
hans-peter.sl...@atos.net wrote:

 I managed to install apache-cassandra-1.2.0-bin.tar.gz



 With java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64 I still get the
segmentation fault.

 However with java-1.7.0-openjdk-1.7.0.3-2.1.0.1.el6.7.x86_64 everything
runs fine.



 Regards Hans-Peter



 From: aaron morton [mailto:aa...@thelastpickle.com]
 Sent: dinsdag 15 januari 2013 1:20
 To: user@cassandra.apache.org
 Subject: Re: Starting Cassandra



 DSE includes hadoop files. It looks like the installation is broken. I
would start again if possible and/or ask the peeps at Data Stax about your
particular OS / JVM configuration.



 In the past I've used this to set a particular JVM when multiple ones are
installed…



 update-alternatives --set java /usr/lib/jvm/java-6-sun/jre/bin/java



 Cheers



 -

 Aaron Morton

 Freelance Cassandra Developer

 New Zealand



 @aaronmorton

 http://www.thelastpickle.com



 On 11/01/2013, at 10:55 PM, Sloot, Hans-Peter hans-peter.sl...@atos.net
wrote:

 Hi,

 I removed the open-jdk packages which caused the dse* packages to be
uninstalled too and installed jdk6u38.



 But when I installed the dse packages yum also downloaded and installed
the open-jdk packages.

  

 --
 Join Barracuda Networks in the fight against hunger.
 To learn how you can help in your community, please visit:
http://on.fb.me/UAdL4f

Re: Cassandra Consistency problem with NTP

2013-01-17 Thread Edward Capriolo

If you have 40ms NTP drift something is VERY VERY wrong. You should have a
local NTP server on the same subnet, do not try to use one on the moon.

On Thu, Jan 17, 2013 at 4:42 AM, Sylvain Lebresne sylv...@datastax.comwrote:


 So what I want is, Cassandra provide some information for client, to
 indicate A is stored before B, e.g. global unique timestamp, or  row order.


 The row order is determined by 1) the comparator you use for the column
 family and 2) the column names you, the client, choose for A and B. So what
 are the column names you use for A and B?

 Now what you could do is use a TimeUUID comparator for that column family
 and use a time uuid for A and B column names. In that case, provided A and
 B are sent from the same client node and B is sent after A on that client
 (which you said is the case), then any non buggy time uuid generator will
 guarantee that the uuid generated for A will be smaller than the one for B
 and thus that in Cassandra, A will be sorted before B.

 In any case, the point I want to make is that Cassandra itself cannot do
 anything for you problem, because by design the row ordering is something
 entirely controlled client side (and just so there is no misunderstanding,
 I want to make that point not because I'm not trying to suggest you were
 wrong asking this mailing list, but because we can't suggest a proper
 solution unless we clearly understand what the problem is).

 --
 Sylvain





 2013/1/17 Sylvain Lebresne sylv...@datastax.com

 I'm not sure I fully understand your problem. You seem to be talking of
 ordering the requests, in the order they are generated. But in that case,
 you will rely on the ordering of columns within whatever row you store
 request A and B in, and that order depends on the column names, which in
 turns is client provided and doesn't depend at all of the time
 synchronization of the cluster nodes. And since you are able to say that
 request A comes before B, I suppose this means said requests are generated
 from the same source. In which case you just need to make sure that the
 column names storing each request respect the correct ordering.

 The column timestamps Cassandra uses are here to which update *to the
 same column* is the more recent one. So it only comes into play if you
 requests A and B update the same column and you're interested in knowing
 which one of the update will win when you read. But even if that's your
 case (which doesn't sound like it at all from your description), the column
 timestamp is only generated server side if you use CQL. And even in that
 latter case, it's a convenience and you can force a timestamp client side
 if you really wish. In other words, Cassandra dependency on time
 synchronization is not a strong one even in that case. But again, that
 doesn't seem at all to be the problem you are trying to solve.

 --
 Sylvain


 On Thu, Jan 17, 2013 at 2:56 AM, Jason Tang ares.t...@gmail.com wrote:

 Hi

 I am using Cassandra in a message bus solution, the major
 responsibility of cassandra is recording the incoming requests for later
 consumming.

 One strategy is First in First out (FIFO), so I need to get the stored
 request in reversed order.

 I use NTP to synchronize the system time for the nodes in the cluster.
 (4 nodes).

 But the local time of each node are still have some inaccuracy, around
 40 ms.

 The consistency level is write all and read one, and replicate factor
 is 3.

 But here is the problem:
 A request come to node One at local time PM 10:00:01.000
 B request come to node Two at local time PM 10:00:00.980

 The correct order is A -- B
 But the timestamp is B -- A

 So is there any way for Cassandra to keep the correct order for read
 operation? (e.g. logical timestamp ?)

 Or Cassandra strong depence on time synchronization solution?

 BRs
 //Tang

Re: Cassandra Performance Benchmarking.

2013-01-17 Thread Edward Capriolo

Wow you managed to do a load test through the cassandra-cli. There should
be a merit badge for that.

You should use the built in stress tool or YCSB.

The CLI has to do much more string conversion then a normal client would
and it is not built for performance. You will definitely get better numbers
through other means.

On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha
pradeep...@gmail.comwrote:

 Hi,

 I am trying to maximize execution of the number of read queries/second.

 Here is my cluster configuration.

 Replication - Default
 12 Data Nodes.
 16 Client Nodes - used for querying.

 Each client node executes 32 threads - each thread executes 76896 read
 queries using  cassandra-cli tool.
i.e all the read queries are stored in a file and that file is
 given to cassandra-cli tool ( using -f option ) which is executed by a
 thread.
 so, total number of queries for 16 client Nodes is 16 * 32 * 76896.

 The read queries on each client node submitted at the same time. The
 time taken for 16 * 32 * 76896 read queries is nearly 742 seconds -
 which is nearly 53k transactions/second.

 I would like to know if there is any other way/tool through which I
 can improve the number of transactions/second.
 Is the performance affected by cassandra-cli tool?

 thanks
 pradeep

Re: Key-hash based node selection

2013-01-19 Thread Edward Capriolo

You can not be /mostly/ consistent readlike you can not be half-pregnant
or half transactional. You either are or you are not.

If you do not have enough nodes for a QUORUM the read fails. Thus you never
get stale reads you only get failed reads.

The dynamic snitch makes reads sticky at READ.ONE. Until a node crosses
the badness_threshold,reads should be routed to the same node. (first
natural endpoint). This is not a guarantee as each node is keeping snitch
scores and routing requests based on its view of the scores.

So at READ.ONE you could argue that Cassandra is mostly consistent based
on your definition.


On Fri, Jan 18, 2013 at 7:23 PM, Timothy Denike ti...@circuitboy.orgwrote:

 /mostly/ consistent reads

Re: Is this how to read the output of nodetool cfhistograms?

2013-01-22 Thread Edward Capriolo

This was described in good detail here:

http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

On Tue, Jan 22, 2013 at 9:41 AM, Brian Tarbox tar...@cabotresearch.comwrote:

 Thank you!   Since this is a very non-standard way to display data it
 might be worth a better explanation in the various online documentation
 sets.

 Thank you again.

 Brian


 On Tue, Jan 22, 2013 at 9:19 AM, Mina Naguib mina.nag...@adgear.comwrote:



 On 2013-01-22, at 8:59 AM, Brian Tarbox tar...@cabotresearch.com wrote:

  The output of this command seems to make no sense unless I think of it
 as 5 completely separate histograms that just happen to be displayed
 together.
 
  Using this example output should I read it as: my reads all took either
 1 or 2 sstable.  And separately, I had write latencies of 3,7,19.  And
 separately I had read latencies of 2, 8,69, etc?
 
  In other words...each row isn't really a row...i.e. on those 16033
 reads from a single SSTable I didn't have 0 write latency, 0 read latency,
 0 row size and 0 column count.  Is that right?

 Correct.  A number in any of the metric columns is a count value bucketed
 in the offset on that row.  There are no relationships between other
 columns on the same row.

 So your first row says 16033 reads were satisfied by 1 sstable.  The
 other metrics (for example, latency of these reads) is reflected in the
 histogram under Read Latency, under various other bucketed offsets.

 
  Offset  SSTables Write Latency  Read Latency  Row
 Size  Column Count
  1  16033 00
0 0
  2303   00
  0 1
  3  0 00
0 0
  4  0 00
0 0
  5  0 00
0 0
  6  0 00
0 0
  7  0 00
0 0
  8  0 02
0 0
  10 0 00
0  6261
  12 0 02
0   117
  14 0 08
0 0
  17 0 3   69
0   255
  20 0 7  163
0 0
  24 019 1369
0 0

Re: Large commit log reasons

2013-01-23 Thread Edward Capriolo

By default Cassandra uses 1/3rd heap size for memtable storage. If you make
sure memtables smaller they should flush faster and you commit logs should
not grow large.

Large commit logs are not a problem, some use cases that write to some
Column Families more then other can make the commit log directory grow.
Basically the commit log does not get removed until everything in it is
flushed. We have a nagios alarm on ours, if it hits 8GB something is wrong,
but again large commit log is normal and I would not worry.

Edward

On Wed, Jan 23, 2013 at 10:42 AM, vhmolinar vhmoli...@gmail.com wrote:

 Hi fellows.
 I current have 3 nodes cluster running with a replication factor of 1.
 It's a pretty simple deployment and all my enforcements are focused in
 writes rather than reads.
 Actually I'm noticing that my commit log size is always very big if
 compared
 to the ammout of data being persisted(which varies on 5gb).

 So, that lead me to three doubts:
 1- When a commit log gets bigger, does it mean that cassandra hasnt
 processed yet those writes?
 2- How could I speed my flushes to sstables?
 3- Does my commit log decrease as much as my sstable increases? Is it a
 rule?



 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Large-commit-log-reasons-tp7584964.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.

Re: Large commit log reasons

2013-01-23 Thread Edward Capriolo

1. The commit log is only read on startup.
W: If writes are unflushed then the commit logs need to be replayed

2: shrink the memtable settings.
but you dont want to do this.

3. Commit log size is not directly related to sstable size.
E.g. if you write the same row a billion times the commit log size will be
large but the sstable will be 1 row.

On Wed, Jan 23, 2013 at 11:10 AM, vhmolinar vhmoli...@gmail.com wrote:

 W

Re: Issue when deleting Cassandra rowKeys.

2013-01-26 Thread Edward Capriolo

Make sure the timestamp on your delete is  then timestamp of the data.

On Sat, Jan 26, 2013 at 1:33 PM, Kasun Weranga kas...@wso2.com wrote:

 Hi all,

 When I delete some rowkeys programmatically I can see two rowkeys remains
 in the column family. I think it is due to tombstones. Is there a way to
 remove it when deleting rowkeys. Can I run compaction programmatically
 after deletion? will it remove all these remaining rowkeys.

 Thanks,
 Kasun.

Re: Denormalization

2013-01-27 Thread Edward Capriolo

One technique is on the client side you build a tool that takes the even
and produces N mutations. In c* writes are cheap so essentially, re-write
everything on all changes.

On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
fredrik.l.stigb...@sitevision.se wrote:

 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?

 Regards
 /Fredrik

Re: Denormalization

2013-01-27 Thread Edward Capriolo

When I said that writes were cheap, I was speaking that in a normal case
people are making 2-10 inserts what in a relational database might be one.
30K inserts is certainly not cheap.

Your use case with 30,000 inserts is probably a special case. Most
directory services that I am aware of OpenLDAP, Active Directory, Sun
Directory server do eventually consistent master/slave and multi-master
replication. So no worries about having to background something. You just
want the replication to be fast enough so that when you call the employee
about to be fired into the office, that by the time he leaves and gets home
he can not VPN to rm -rf / your main file server :)


On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 Sometimes this is true, sometimes not…..….We have a use case where we have
 an admin tool where we choose to do this denorm for ACL on permission
 checks to make permission checks extremely fast.  That said, we have one
 issue with one object that too many children(30,000) so when someone gives
 a user access to this one object with 30,000 children, we end up with a bad
 60 second wait and users ended up getting frustrated and trying to
 cancel(our plan since admin activity hardly ever happens is to do it on our
 background thread and just return immediately to the user and tell him his
 changes will take affect in 1 minute ).  After all, admin changes are
 infrequent anyways.  This example demonstrates how sometimes it could
 almost burn you.

 I guess my real point is it really depends on your use cases ;).  In a lot
 of cases denorm can work but in some cases it burns you so you have to
 balance it all.  In 90% of our cases our denorm is working great and for
 this one case, we need to background the permission change as we still LOVE
 the performance of our ACL checks.

 Ps. 30,000 writes in cassandra is not cheap when done from one server ;)
 but in general parallized writes is very fast for like 500.

 Later,
 Dean

 From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
 
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Sunday, January 27, 2013 5:50 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Denormalization

 One technique is on the client side you build a tool that takes the even
 and produces N mutations. In c* writes are cheap so essentially, re-write
 everything on all changes.

 On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck 
 fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se
 wrote:
 Hi.
 Since denormalized data is first-class citizen in Cassandra, how to
 handle updating denormalized data.
 E.g. If we have  a USER cf with name, email etc. and denormalize user
 data into many other CF:s and then
 update the information about a user (name, email...). What is the best
 way to handle updating those user data properties
 which might be spread out over many cf:s and many rows?

 Regards
 /Fredrik

1 2 3 4 5 6 7 8 >

1 - 100 of 742 matches

Mail list logo