Re: yet a couple more questions on composite columns

2012-02-05 Thread R. Verlangen
Yiming, I am using 2 CF's. Performance wise this should not be an issue. I
use it for small files data store. My 2 CF's are:

FilesMeta
FilesData

2012/2/5 Yiming Sun yiming@gmail.com

 Interesting idea, Jim.  Is there a reason you don't you use
 metadata:{accountId} instead?  For performance reasons?


 On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona j...@anconafamily.com wrote:

 I've used special values which still comply with the Composite
 schema for the metadata columns, e.g. a column of
 1970-01-01:{accountId} for a metadata column where the Composite is
 DateType:UTF8Type.

 Jim

 On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun yiming@gmail.com wrote:
  Thanks Andrey and Chris.  It sounds like we don't necessarily have to
 use
  composite columns.  From what I understand about dynamic CF, each row
 may
  have completely different data from other rows;  but in our case, the
 data
  in each row is similar to other rows; my concern was more about the
  homogeneity of the data between columns.
 
  In our original supercolumn-based schema, one special supercolumn is
 called
  metadata which contains a number of subcolumns to hold metadata
 describing
  each collection (e.g. number of documents, etc.), then the rest of the
  supercolumns in the same row are all IDs of documents belong to the
  collection, and for each document supercolumn, the subcolumns contain
 the
  document content as well as metadata on individual document (e.g.
 checksum
  of each document).
 
  To move away from the supercolumn schema, I could either create two
 CFs, one
  to hold metadata, the other document content; or I could create just
 one CF
  mixing metadata and doc content in the same row, and using composite
 column
  names to identify if the particular column is metadata or a document.
  I am
  just wondering if you have any inputs on the pros and cons of each
 schema.
 
  -- Y.
 
 
  On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken 
 chrisger...@mindspring.com
  wrote:
 
 
 
 
  On 4 February 2012 06:21, Yiming Sun yiming@gmail.com wrote:
 
  I cannot have one composite column name with 3 components while
 another
  with 4 components?
 
   Just put 4 components and left last empty (if it is same type)?!
 
  Another question I have is how flexible composite columns actually
 are.
   If my data model has a CF containing US zip codes with the following
  composite columns:
 
  {OH:Spring Field} : 45503
  {OH:Columbus} : 43085
  {FL:Spring Field} : 32401
  {FL:Key West}  : 33040
 
  I know I can ask cassandra to give me the zip codes of all cities in
  OH.  But can I ask it to give me the zip codes of all cities named
 Spring
  Field using this model?  Thanks.
 
  No. You set first composite component at first.
 
 
  I'd use a dynamic CF:
  row key = state abbreviation
  column name = city name
  column value = zip code (or a complex object, one of whose properties
 is
  zip code)
 
  you can iterate over the columns in a single row to get a state's city
  names and their zip code and you can do a get_range_slices on all keys
 for
  the columns starting and ending on the city name to find out the zip
 codes
  for a cities with the given name.
 
  I think
 
  - Chris
 
 





Re: yet a couple more questions on composite columns

2012-02-05 Thread Yiming Sun
Thanks R.V.!! We are also dealing with many small files, so this sounds
really promising.

-- Y.

On Sun, Feb 5, 2012 at 9:59 AM, R. Verlangen ro...@us2.nl wrote:

 Yiming, I am using 2 CF's. Performance wise this should not be an issue. I
 use it for small files data store. My 2 CF's are:

 FilesMeta
 FilesData


 2012/2/5 Yiming Sun yiming@gmail.com

 Interesting idea, Jim.  Is there a reason you don't you use
 metadata:{accountId} instead?  For performance reasons?


 On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona j...@anconafamily.com wrote:

 I've used special values which still comply with the Composite
 schema for the metadata columns, e.g. a column of
 1970-01-01:{accountId} for a metadata column where the Composite is
 DateType:UTF8Type.

 Jim

 On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun yiming@gmail.com wrote:
  Thanks Andrey and Chris.  It sounds like we don't necessarily have to
 use
  composite columns.  From what I understand about dynamic CF, each row
 may
  have completely different data from other rows;  but in our case, the
 data
  in each row is similar to other rows; my concern was more about the
  homogeneity of the data between columns.
 
  In our original supercolumn-based schema, one special supercolumn is
 called
  metadata which contains a number of subcolumns to hold metadata
 describing
  each collection (e.g. number of documents, etc.), then the rest of the
  supercolumns in the same row are all IDs of documents belong to the
  collection, and for each document supercolumn, the subcolumns contain
 the
  document content as well as metadata on individual document (e.g.
 checksum
  of each document).
 
  To move away from the supercolumn schema, I could either create two
 CFs, one
  to hold metadata, the other document content; or I could create just
 one CF
  mixing metadata and doc content in the same row, and using composite
 column
  names to identify if the particular column is metadata or a document.
  I am
  just wondering if you have any inputs on the pros and cons of each
 schema.
 
  -- Y.
 
 
  On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken 
 chrisger...@mindspring.com
  wrote:
 
 
 
 
  On 4 February 2012 06:21, Yiming Sun yiming@gmail.com wrote:
 
  I cannot have one composite column name with 3 components while
 another
  with 4 components?
 
   Just put 4 components and left last empty (if it is same type)?!
 
  Another question I have is how flexible composite columns actually
 are.
   If my data model has a CF containing US zip codes with the following
  composite columns:
 
  {OH:Spring Field} : 45503
  {OH:Columbus} : 43085
  {FL:Spring Field} : 32401
  {FL:Key West}  : 33040
 
  I know I can ask cassandra to give me the zip codes of all cities in
  OH.  But can I ask it to give me the zip codes of all cities named
 Spring
  Field using this model?  Thanks.
 
  No. You set first composite component at first.
 
 
  I'd use a dynamic CF:
  row key = state abbreviation
  column name = city name
  column value = zip code (or a complex object, one of whose properties
 is
  zip code)
 
  you can iterate over the columns in a single row to get a state's city
  names and their zip code and you can do a get_range_slices on all
 keys for
  the columns starting and ending on the city name to find out the zip
 codes
  for a cities with the given name.
 
  I think
 
  - Chris
 
 






Re: Restart cassandra every X days?

2012-02-05 Thread aaron morton
Close enough for me :)
A

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/02/2012, at 8:39 PM, R. Verlangen wrote:

 Well, it seems it's balancing itself, 24 hours later the ring looks like this:
 
 ***.89datacenter1 rack1   Up Normal  7.36 GB 50.00%  0
 ***.135datacenter1 rack1   Up Normal  8.84 GB 50.00%  
 85070591730234615865843651857942052864
 
 Looks pretty normal, right?
 
 2012/2/2 aaron morton aa...@thelastpickle.com
 Speaking technically, that ain't right.
 
 I would:
 * Check if node .135 is holding a lot of hints. 
 * Take a look on disk and see what is there.
 * Go through a repair and compact on each node.
 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 2/02/2012, at 9:55 PM, R. Verlangen wrote:
 
 Yes, I already did a repair and cleanup. Currently my ring looks like this:
 
 Address DC  RackStatus State   LoadOwns  
   Token
 ***.89datacenter1 rack1   Up Normal  2.44 GB 50.00%  0
 ***.135datacenter1 rack1   Up Normal  6.99 GB 50.00%  
 85070591730234615865843651857942052864
 
 It's not really a problem, but I'm still wondering why this happens.
 
 2012/2/1 aaron morton aa...@thelastpickle.com
 Do you mean the load in nodetool ring is not even, despite the tokens been 
 evenly distributed ? 
 
 I would assume this is not the case given the difference, but it may be 
 hints given you have just done an upgrade. Check the system using nodetool 
 cfstats to see. They will eventually be delivered and deleted. 
 
 More likely you will want to:
 1) nodetool repair to make sure all data is distributed then
 2) nodetool cleanup if you have changed the tokens at any point finally
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 31/01/2012, at 11:56 PM, R. Verlangen wrote:
 
 After running 3 days on Cassandra 1.0.7 it seems the problem has been 
 solved. One weird thing remains, on our 2 nodes (both 50% of the ring), the 
 first's usage is just over 25% of the second. 
 
 Anyone got an explanation for that?
 
 2012/1/29 aaron morton aa...@thelastpickle.com
 Yes but…
 
 For every upgrade read the NEWS.TXT it will go through the upgrade 
 procedure in detail. If you want to feel extra smart scan through the 
 CHANGES.txt to get an idea of whats going on. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 29/01/2012, at 4:14 AM, Maxim Potekhin wrote:
 
 Sorry if this has been covered, I was concentrating solely on 0.8x --
 can I just d/l 1.0.x and continue using same data on same cluster?
 
 Maxim
 
 
 On 1/28/2012 7:53 AM, R. Verlangen wrote:
 
 Ok, seems that it's clear what I should do next ;-)
 
 2012/1/28 aaron morton aa...@thelastpickle.com
 There are no blockers to upgrading to 1.0.X.
 
 A 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 28/01/2012, at 7:48 AM, R. Verlangen wrote:
 
 Ok. Seems that an upgrade might fix these problems. Is Cassandra 1.x.x 
 stable enough to upgrade for, or should we wait for a couple of weeks?
 
 2012/1/27 Edward Capriolo edlinuxg...@gmail.com
 I would not say that issuing restart after x days is a good idea. You 
 are mostly developing a superstition. You should find the source of the 
 problem. It could be jmx or thrift clients not closing connections. We 
 don't   restart nodes on a regiment they 
 work fine.
 
 
 On Thursday, January 26, 2012, Mike Panchenko m...@mihasya.com wrote:
  There are two relevant bugs (that I know of), both resolved in 
  somewhat recent versions, which make somewhat regular restarts 
  beneficial
  https://issues.apache.org/jira/browse/CASSANDRA-2868 (memory leak in 
  GCInspector, fixed in 0.7.9/0.8.5)
  https://issues.apache.org/jira/browse/CASSANDRA-2252 (heap 
  fragmentation due to the way memtables used to be allocated, 
  refactored in 1.0.0)
  Restarting daily is probably too frequent for either one of those 
  problems. We usually notice degraded performance in our ancient 
  cluster after ~2 weeks w/o a restart.
  As Aaron mentioned, if you have plenty of disk space, there's no 
  reason to worry about cruft sstables. The size of your active set is 
  what matters, and you can determine if that's getting too big by 
  watching for iowait (due to reads from the data partition) and/or 
  paging activity of the java process. When you hit that problem, the 
  solution is to 1. try to tune your caches and 2. add more nodes to 
  spread the load. I'll reiterate - looking at raw disk space usage 
  should not be your guide for that.
  Forcing a gc generally works, but should not be relied upon (note 
  suggest in 
  

Re: Consurrent compactors

2012-02-05 Thread aaron morton
Not sure I understand the question. Do you have an example where a CF is not 
getting compacted ? 

The compaction tasks will be processed in the order they are submitted. If you 
have concurrent_compactors  1 then the thread pool for compactions (excluding 
validation compactions) will be able to process multiple compaction tasks in 
parallel. 

If you have a CF that gets a lot more traffic than other CF's it will require 
more compaction. But by running concurrent_compactors  1 smaller CF's should 
still be able to get through. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/02/2012, at 9:15 PM, Viktor Jevdokimov wrote:

 My concern is not anout cleanup, but about supposed „tendency of small 
 sstables to accumulate during a single long running compactions“. When next 
 task is for the same column family as currently long-running compaction, 
 other column families compactions are freezed and concurrent_compactors  
 1setting just not working.
  
  
 Best regards/ Pagarbiai
  
 Viktor Jevdokimov
 Senior Developer
  
 Email:  viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063. Fax: +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
  
  
 signature-logo29.png
 
 dm-exco4823.png
 Follow:
 
 tweet18be.png
 Visit our blog
 
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.
 
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Wednesday, February 01, 2012 21:51
 To: user@cassandra.apache.org
 Subject: Re: Consurrent compactors
  
 (Assuming 1.0* release)
 From the comments in cassandra.yaml
  
 # Number of simultaneous compactions to allow, NOT including
 # validation compactions for anti-entropy repair.  Simultaneous
 # compactions can help preserve read performance in a mixed read/write
 # workload, by mitigating the tendency of small sstables to accumulate
 # during a single long running compactions. The default is usually
 # fine and if you experience problems with compaction running too
 # slowly or too fast, you should look at
 # compaction_throughput_mb_per_sec first.
 #
 # This setting has no effect on LeveledCompactionStrategy.
 #
 # concurrent_compactors defaults to the number of cores.
 # Uncomment to make compaction mono-threaded, the pre-0.8 default.
 #concurrent_compactors: 1
  
 If you set it to 1 then only 1 compaction should run at a time, excluding 
 validation. 
  
 How often do you run a cleanup compaction ? They are only necessary when you 
 perform a token move.
  
 Cheers
  
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 1/02/2012, at 9:48 PM, Viktor Jevdokimov wrote:
 
 
 Hi,
  
 When concurrent compactors are set to more then 1, it’s rare when more than 1 
 compaction is running in parallel.
  
 Didn’t checked the source code, but it looks like when next compaction task 
 (any of minor, major, or cleanup) is for the same CF, it will not start in 
 parallel and next tasks are not checked.
  
 Will it be possible to check all tasks, not only the next one, to find which 
 of them can be started?
  
 This is actual especially when nightly cleanup is running, a lot of cleanup 
 tasks are pending, regular minor compactions are waiting until all cleanup 
 compactions are finished.
  
  
  
 Best regards/ Pagarbiai
  
 Viktor Jevdokimov
 Senior Developer
  
 Email:  viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063. Fax: +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
  
  
 signature-logo29.png
 
 dm-exco4823.png
 Follow:
 
 tweet18be.png
 Visit our blog
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.
  
  



Re: Write latency of counter updates across multiple rows

2012-02-05 Thread aaron morton
I'm not thinking about counters specifically here, and assuming you are sending 
batch mutations of the same size… 

The mutations (inserts, counter increments) for a row are turned into a single 
task server side, and are then processed in a serial fashion. If you send a 
mutation for 2 rows it will be turned into two tasks, which can then be 
processed in parallel. 

There is an point of dimensioning returns here. Each row you write to or read 
from will become a task, if you write to 1,000 rows at once you will put 1,000 
tasks in the thread pool which typically has 32 concurrent threads. This may 
block / add latency to other requests. It's more of an issue with reads than 
writes. 

Does that apply to your situation ? 

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 4/02/2012, at 1:19 AM, Amit Chavan wrote:

 
 Hi,
 
 In our use case, we maintain minute-wise roll ups for different metrics. 
 These are stored in a counter column family where the row key is a composite 
 containing the timestamp rounded to the last minute and an integer between 
 0-9 (This integer is calculated as the MD5 hash of the metric mod 10). The 
 column names are the metrics we wish to track. Typically, each row has about 
 100,000 counters.
 
 We tested two scenarios. The first one is as mentioned above. In this case we 
 got a per write latency of about 80 micro-seconds to 100 micro-seconds.
 
 In the other scenario, we calculated the integer in the row key as mod 100. 
 In this case we observed a per write latency of 50 micro-seconds to 70 
 micro-seconds.
 
 I wish to understand why updates to counters were faster as they got spread 
 across multiple rows?
 
 Cluster summary : 4 nodes running Cassandra 1.0.5. Each with 8 cores, 32G 
 RAM, 10G Cassandra heap. We are using replication factor of 2.
 
 
 -- 
 Thanks!
 Amit Chavan
 



Re: nodetool hangs and didn't print anything with firewall

2012-02-05 Thread Mohit Anchlia
Does it work with iptables disabled?

You could add log to your firewall rules to see if firewall is
dropping the packets.

On Sun, Feb 5, 2012 at 5:35 PM, Roshan codeva...@gmail.com wrote:
 Hi

 I have 2 node Cassandra cluster and each linux box configured with a
 firewall. The ports 7000, 7199 and 9160 are open in the firewall and I can
 telnet to the ports from both ends without any issue. But if I try to do a
 nodetool from one of the Cassandra node, it hangs and didn't print anything.

 $ sh nodetool -h app1 info

 Could someone please help me on this? Thanks.

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-hangs-and-didn-t-print-anything-with-firewall-tp7257286p7257286.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.


Best way to know the cluster status

2012-02-05 Thread Tamil selvan R.S
Hi,
 What is the best way to know the cluster status via php?
 Currently we are trying to connect to individual cassandra instance with a
specified timeout and if it fails we report the node to be down.
 But this test remains faulty. What are the other ways to test availability
of nodes in cassandra cluster?
 How does datastax opscenter manage to  do that?

Regards,
Tamil Selvan