Re: Why does `now()` produce different times within the same query?

2016-11-30 Thread Todd Fast
FWIW I'd suggest opening a bug--this behavior is certainly quite unexpected
and more than just a documentation issue. In general I can't imagine any
desirable properties of the current implementation, and there are likely a
bunch of latent bugs sitting out there, so it should be fixed.

Todd

On Wed, Nov 30, 2016 at 12:37 PM Terry Liu  wrote:

> Sorry for my typo. Obviously, I meant:
> "It appears that a single query that calls Cassandra's`now()` time
> function *multiple times *may actually cause a query to write or return
> different times."
>
> Less of a surprise now that I realize more about the implementation, but I
> agree that more explicit documentation around when exactly the "execution"
> of each now() statement happens and what implications it has for the
> resulting timestamps would be helpful when running into this.
>
> Thanks for the quick responses!
>
> -Terry
>
>
>
> On Tue, Nov 29, 2016 at 2:45 PM, Marko Å valjek  wrote:
>
> every now() call in statement is under the hood "replaced" with newly
> generated uuid.
>
> It can happen that they belong to  different milliseconds in time.
>
> If you need to have same timestamps you need to set them on the client
> side.
>
>
> @msvaljek 
>
> 2016-11-29 22:49 GMT+01:00 Terry Liu :
>
> It appears that a single query that calls Cassandra's `now()` time
> function may actually cause a query to write or return different times.
>
> Is this the expected or defined behavior, and if so, why does it behave
> like this rather than evaluating `now()` once across an entire statement?
>
> This really affects UPDATE statements but to test it more easily, you
> could try something like:
>
> SELECT toTimestamp(now()) as a, toTimestamp(now()) as b
> FROM keyspace.table
> LIMIT 100;
>
> If you run that a few times, you should eventually see that the timestamp
> returned moves onto the next millisecond mid-query.
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>
>
>
>
>
> --
> *Software Engineer*
> Turnitin - http://www.turnitin.com
> t...@turnitin.com
>


Dynamic schema modification an anti-pattern?

2014-10-06 Thread Todd Fast
There is a team at my work building a entity-attribute-value (EAV) store
using Cassandra. There is a column family, called Entity, where the
partition key is the UUID of the entity, and the columns are the attributes
names with their values. Each entity will contain hundreds to thousands of
attributes, out of a list of up to potentially ten thousand known attribute
names.

However, instead of using wide rows with dynamic columns (and serializing
type info with the value), they are trying to use a static column family
and modifying the schema dynamically as new named attributes are created.

(I believe one of the main drivers of this approach is to use collection
columns for certain attributes, and perhaps to preserve type metadata for a
given attribute.)

This approach goes against everything I've seen and done in Cassandra, and
is generally an anti-pattern for most persistence stores, but I want to
gather feedback before taking the next step with the team.

Do others consider this approach an anti-pattern, and if so, what are the
practical downsides?

For one, this means that the Entity schema would contain the superset of
all columns for all rows. What is the impact of having thousands of columns
names in the schema? And what are the implications of modifying the schema
dynamically on a decent sized cluster (5 nodes now, growing to 10s later)
under load?

Thanks,
Todd


Re: abusing cassandra's multi DC abilities

2014-02-24 Thread Todd Fast
Hi Jonathan--

First, best wishes for success with your platform.

Frankly, I think the architecture you described is only going to cause
you major trouble. I'm left wondering why you don't either use something
like XMPP (of which several implementations can handle this kind of
federated scenario) or simply have internal (REST) APIs to send a message
from the backend in one DC to the backend in another DC.

There are a bunch of ways to approach this problem: You could also use
Redis pubsub (though a bit brittle), SQS, or any number of other approaches
that would be simpler and more robust than what you described. I'd urge you
to really consider another approach.

Best,
Todd

On Saturday, February 22, 2014, Jonathan Haddad j...@jonhaddad.com wrote:

 Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
 changed data from DC1 shows up in DC2.

 Full Story:
 We're planning on adding data centers throughout the US.  Our platform is
 used for business communications.  Each DC currently utilizes elastic
 search and redis.  A message can be sent from one user to another, and the
 intent is that it would be seen in near-real-time.  This means that 2
 people may be using different data centers, and the messages need to
 propagate from one to the other.

 On the plus side, we know we get this with Cassandra (fist pump) but the
 other pieces, not so much.  Even if they did work, there's all sorts of
 race conditions that could pop up from having different pieces of our
 architecture communicating over different channels.  From this, we've
 arrived at the idea that since Cassandra is the authoritative data source,
 we might be able to trigger events in DC2 based on activity coming through
 either the commit log or some other means.  One idea was to use a CF with a
 low gc time as a means of transporting messages between DCs, and watching
 the commit logs for deletes to that CF in order to know when we need to do
 things like reindex a document (or a new document), bust cache, etc.
  Facebook did something similar with their modifications to MySQL to
 include cache keys in the replication log.

 Assuming this is sane, I'd want to avoid having the same event register on
 3 servers, thus registering 3 items in the queue when only one should be
 there.  So, for any piece of data replicated from the other DC, I'd need a
 way to determine if it was supposed to actually trigger the event or not.
  (Maybe it looks at the token and determines if the current server falls in
 the token range?)  Or is there a better way?

 So, my questions to all ye Cassandra users:

 1. Is this is even sane?
 2. Is anyone doing it?

 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: schema management

2013-07-01 Thread Todd Fast
Franc--

I think you will find Mutagen Cassandra very interesting; it is similar to
schema management tools like Flyway for SQL databases:

Mutagen Cassandra is a framework (based on Mutagen) that provides schema
 versioning and mutation for Apache Cassandra.

 Mutagen is a lightweight framework for applying versioned changes (known
 as mutations) to a resource, in this case a Cassandra schema. Mutagen takes
 into account the resource's existing state and only applies changes that
 haven't yet been applied.

 Schema mutation with Mutagen helps you make manageable changes to the
 schema of live Cassandra instances as you update your software, and is
 especially useful when used across development, test, staging, and
 production environments to automatically keep schemas in sync.



https://github.com/toddfast/mutagen-cassandra

Todd


On Mon, Jul 1, 2013 at 5:23 PM, sankalp kohli kohlisank...@gmail.comwrote:

 You can generate schema through the code. That is also one option.


 On Mon, Jul 1, 2013 at 4:10 PM, Franc Carter franc.car...@sirca.org.auwrote:


 Hi,

 I've been giving some thought to the way we deploy schemas and am looking
 for something better than out current approach, which is to use
 cassandra-cli scripts.

 What do people use for this ?

 cheers

 --

 *Franc Carter* | Systems architect | Sirca Ltd
  marc.zianideferra...@sirca.org.au

 franc.car...@sirca.org.au | www.sirca.org.au

 Tel: +61 2 8355 2514

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215






Re: Announcing Mutagen

2013-05-17 Thread Todd Fast
Hi Blair--

Thanks for digging into the code. I did indeed experiment with longer
timeouts and the result was that trying to obtain the lock hung for
whatever amount of time I set the timeout for. I am not an expert on
Astyanax and haven't debugged my use of that recipe yet; I don't even know
if I've configured it correctly. Perhaps you have some guidance?

(Funny you mention your own migration framework--Mutagen is the second one
I've done for Cassandra. The first one, a plugin for Mokol, also had schema
rollbacks and some other features, but was only command-line.)


On Thu, May 16, 2013 at 11:06 PM, Blair Zajac bl...@orcaware.com wrote:

 On 5/16/13 10:22 PM, Todd Fast wrote:

 Mutagen Cassandra is a framework providing schema versioning and
 mutation for Apache Cassandra. It is similar to Flyway for SQL databases.

 https://github.com/toddfast/**mutagen-cassandrahttps://github.com/toddfast/mutagen-cassandra

 Mutagen is a lightweight framework for applying versioned changes (known
 as mutations) to a resource, in this case a Cassandra schema. Mutagen
 takes into account the resource's existing state and only applies
 changes that haven't yet been applied.


 Hi Todd,

 Looking at your code and you have the ColumnPrefixDistributedRowLock
 commented out.  Could it be that the mutation is taking longer than a
 second to run?  Are they only happening during testing simultaneous
 updates?  Maybe they aren't being cleaned up?

 Funny timing, I'm working on porting Scala Migrations [1] to Cassandra and
 have a working implementation.  It's not as fancy as Scala Migrations (it
 doesn't scan a package for migration subclasses and it currently doesn't do
 rollbacks) but it gets the basics done.  Hoping to release code in the near
 future.

 Differences from Mutagen:

 1) Mutations are written only in Scala.
 2) Since its a new project, it uses a Java Driver session instead of a
 Astyanax connection since I only intend to use CQL3 tables.

 Blair

 [1] 
 http://code.google.com/p/**scala-migrations/http://code.google.com/p/scala-migrations/



Announcing Mutagen

2013-05-16 Thread Todd Fast
Mutagen Cassandra is a framework providing schema versioning and mutation
for Apache Cassandra. It is similar to Flyway for SQL databases.

https://github.com/toddfast/mutagen-cassandra

Mutagen is a lightweight framework for applying versioned changes (known as
mutations) to a resource, in this case a Cassandra schema. Mutagen takes
into account the resource's existing state and only applies changes that
haven't yet been applied.

Schema mutation with Mutagen helps you make manageable changes to the
schema of live Cassandra instances as you update your client software, and
is especially useful when used across development, test, staging, and
production environments to automatically keep schemas updated.

This is a minimal but functional initial release, and I appreciate bug
reports, suggestions and pull requests.

Best,
Todd


Differences in row iteration behavior

2012-09-14 Thread Todd Fast

Hi--

We are iterating rows in a column family two different ways and are 
seeing radically different row counts. We are using 1.0.8 and 
RandomPartitioner on a 3-node cluster.


In the first case, we have a trivial Hadoop job that counts 29M rows 
using the standard MR pattern for counting (mapper outputs a single key 
with a value of 1, reducer adds up all the values).


In the second case, we have a simple Quartz batch job which counts only 
10M rows. We are iterating using chained calls to get_row_slices, as 
described on the wiki: http://wiki.apache.org/cassandra/FAQ#iter_world 
We've also implemented the batch job using Pelops, with and without 
chaining. In all cases, the job counts just 10M rows, and it is not 
encountering any errors.


We are confident that we are doing everything right in both cases (no 
bugs), yet the results are baffling. Tests in smaller, single-node 
environments results in consistent counts between the two methods, but 
we don't have the same amount of data nor the same topology.


Is the right answer 29M or 10M? Any clues to what we're seeing?

Todd


Keyspace missing on restart

2012-02-10 Thread Todd Fast
My single-node cluster was working fine yesterday. I ctrl+c'd it last 
night, as I typically do, and restarted it this morning.


Now, inexplicably, it doesn't know anything about my keyspace. The SS 
table files are in the same directory as always and seem to be the 
expected size. I can't seem to do anything with nodetool, since the 
keyspace isn't known.


1. How can I recover the node?
2. What the heck happened that caused this?

Here is the console log (I'm on Windows 7):

Starting Cassandra Server
 INFO 09:33:41,803 Logging initialized
 INFO 09:33:41,809 JVM vendor/version: Java HotSpot(TM) Client VM/1.6.0_17
 INFO 09:33:41,809 Heap size: 1065484288/1065484288
 INFO 09:33:41,810 Classpath: ...
 INFO 09:33:41,815 JNA not found. Native methods will be disabled.
 INFO 09:33:41,826 Loading settings from 
file:/D:/Java/apache-cassandra-1.0.7/conf/cassandra.yaml
 INFO 09:33:41,930 DiskAccessMode 'auto' determined to be standard, 
indexAccessMode is standard

 INFO 09:33:41,939 Global memtable threshold is enabled at 338MB
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-1 (234 bytes)
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-2 (163 bytes)

 INFO 09:33:42,287 Couldn't detect any schema definitions in local storage.
 INFO 09:33:42,288 Found table data in data directories. Consider using 
the CLI to define your schema.
 INFO 09:33:42,307 Creating new commitlog segment 
/Data/cassandra/node1/commitlog\CommitLog-1328895222306.log
 INFO 09:33:42,316 Replaying 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,356 Finished reading 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,362 Enqueuing flush of Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,364 Writing Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,399 Completed flushing 
\Data\cassandra\node1\system\Versions-hc-1-Data.db (247 bytes)

 INFO 09:33:42,410 Log replay complete, 3 replayed mutations
 INFO 09:33:42,415 Cassandra version: 1.0.7
 INFO 09:33:42,416 Thrift API version: 19.20.0
 INFO 09:33:42,416 Loading persisted ring state
 INFO 09:33:42,420 Starting up server gossip
 INFO 09:33:42,429 Enqueuing flush of 
Memtable-LocationInfo@18721294(29/36 serialized/live bytes, 1 ops)
 INFO 09:33:42,430 Writing Memtable-LocationInfo@18721294(29/36 
serialized/live bytes, 1 ops)
 INFO 09:33:42,450 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-3-Data.db (80 bytes)

 INFO 09:33:42,459 Starting Messaging Service on port 7000
 INFO 09:33:42,469 Using saved token 
133677729504783243750441433892785690257
 INFO 09:33:42,471 Enqueuing flush of 
Memtable-LocationInfo@15427560(53/66 serialized/live bytes, 2 ops)
 INFO 09:33:42,471 Writing Memtable-LocationInfo@15427560(53/66 
serialized/live bytes, 2 ops)
 INFO 09:33:42,490 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db (163 bytes)

 INFO 09:33:42,494 Node localhost/127.0.0.1 state jump to normal
 INFO 09:33:42,511 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 09:33:42,512 Will not load MX4J, mx4j-tools.jar is not in the 
classpath
 INFO 09:33:42,523 Compacting 
[SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-2-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\Loca
tionInfo-hc-3-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-1-Data.db')]

 INFO 09:33:42,567 Binding thrift service to localhost/127.0.0.1:9160
 INFO 09:33:42,571 Using TFastFramedTransport with a max frame size of 
15728640 bytes.
 INFO 09:33:42,576 Using synchronous/threadpool thrift server on 
localhost/127.0.0.1 : 9160

 INFO 09:33:42,577 Listening for thrift clients...

Todd


Re: Keyspace missing on restart

2012-02-10 Thread Todd Fast
I found the problem; it was my fault. I made an accidental change to my 
cassandra.yaml file sometime between restarts and ended up pointing the 
node data directory to a different disk. Check your paths!


Todd


On 2/10/2012 10:22 AM, Todd Fast wrote:
My single-node cluster was working fine yesterday. I ctrl+c'd it last 
night, as I typically do, and restarted it this morning.


Now, inexplicably, it doesn't know anything about my keyspace. The SS 
table files are in the same directory as always and seem to be the 
expected size. I can't seem to do anything with nodetool, since the 
keyspace isn't known.


1. How can I recover the node?
2. What the heck happened that caused this?

Here is the console log (I'm on Windows 7):

Starting Cassandra Server
 INFO 09:33:41,803 Logging initialized
 INFO 09:33:41,809 JVM vendor/version: Java HotSpot(TM) Client 
VM/1.6.0_17

 INFO 09:33:41,809 Heap size: 1065484288/1065484288
 INFO 09:33:41,810 Classpath: ...
 INFO 09:33:41,815 JNA not found. Native methods will be disabled.
 INFO 09:33:41,826 Loading settings from 
file:/D:/Java/apache-cassandra-1.0.7/conf/cassandra.yaml
 INFO 09:33:41,930 DiskAccessMode 'auto' determined to be standard, 
indexAccessMode is standard

 INFO 09:33:41,939 Global memtable threshold is enabled at 338MB
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-1 (234 bytes)
 INFO 09:33:42,232 Opening 
\Data\cassandra\node1\system\LocationInfo-hc-2 (163 bytes)
 INFO 09:33:42,287 Couldn't detect any schema definitions in local 
storage.
 INFO 09:33:42,288 Found table data in data directories. Consider 
using the CLI to define your schema.
 INFO 09:33:42,307 Creating new commitlog segment 
/Data/cassandra/node1/commitlog\CommitLog-1328895222306.log
 INFO 09:33:42,316 Replaying 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,356 Finished reading 
\Data\cassandra\node1\commitlog\CommitLog-1328894913967.log
 INFO 09:33:42,362 Enqueuing flush of 
Memtable-Versions@22744620(83/103 serialized/live bytes, 3 ops)
 INFO 09:33:42,364 Writing Memtable-Versions@22744620(83/103 
serialized/live bytes, 3 ops)
 INFO 09:33:42,399 Completed flushing 
\Data\cassandra\node1\system\Versions-hc-1-Data.db (247 bytes)

 INFO 09:33:42,410 Log replay complete, 3 replayed mutations
 INFO 09:33:42,415 Cassandra version: 1.0.7
 INFO 09:33:42,416 Thrift API version: 19.20.0
 INFO 09:33:42,416 Loading persisted ring state
 INFO 09:33:42,420 Starting up server gossip
 INFO 09:33:42,429 Enqueuing flush of 
Memtable-LocationInfo@18721294(29/36 serialized/live bytes, 1 ops)
 INFO 09:33:42,430 Writing Memtable-LocationInfo@18721294(29/36 
serialized/live bytes, 1 ops)
 INFO 09:33:42,450 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-3-Data.db (80 bytes)

 INFO 09:33:42,459 Starting Messaging Service on port 7000
 INFO 09:33:42,469 Using saved token 
133677729504783243750441433892785690257
 INFO 09:33:42,471 Enqueuing flush of 
Memtable-LocationInfo@15427560(53/66 serialized/live bytes, 2 ops)
 INFO 09:33:42,471 Writing Memtable-LocationInfo@15427560(53/66 
serialized/live bytes, 2 ops)
 INFO 09:33:42,490 Completed flushing 
\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db (163 bytes)

 INFO 09:33:42,494 Node localhost/127.0.0.1 state jump to normal
 INFO 09:33:42,511 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 09:33:42,512 Will not load MX4J, mx4j-tools.jar is not in the 
classpath
 INFO 09:33:42,523 Compacting 
[SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-4-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-2-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\Loca
tionInfo-hc-3-Data.db'), 
SSTableReader(path='\Data\cassandra\node1\system\LocationInfo-hc-1-Data.db')]

 INFO 09:33:42,567 Binding thrift service to localhost/127.0.0.1:9160
 INFO 09:33:42,571 Using TFastFramedTransport with a max frame size of 
15728640 bytes.
 INFO 09:33:42,576 Using synchronous/threadpool thrift server on 
localhost/127.0.0.1 : 9160

 INFO 09:33:42,577 Listening for thrift clients...

Todd


Delete doesn't remove row key?

2012-01-31 Thread Todd Fast

I added a row with a single column to my 1.0.8 single-node cluster:

RowKey: ----
= (column=test, value=hi, timestamp=...)

I immediately deleted the row using both the CLI and CQL:

del Foo[lexicaluuid('----')];
delete from Foo using consistency all where 
KEY=----


In either case, the column test is gone but the empty row key still 
remains, and the row count reflects the presence of this phantom row.


I've tried nodetool compact/repair/flush/cleanup/scrub/etc. and nothing 
removes the row key.


How do I get rid of it?

BTW, I saw this little tidbit in the describe output:

Row cache size / save period in seconds / keys to save : 0.0/0/all

Does all here mean to keep the keys for empty rows? If so, how do I 
change that behavior?


ColumnFamily: Foo
...
  Key Validation Class: org.apache.cassandra.db.marshal.UUIDType
  Default column value validator: 
org.apache.cassandra.db.marshal.UTF8Type

  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds / keys to save : 0.0/0/all
  Row Cache Provider: 
org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider

  Key cache size / save period in seconds: 20.0/14400
  GC grace seconds: 86400
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  Replicate on write: true
  Bloom Filter FP chance: default
  Built indexes: []
  Compaction Strategy: 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy


Todd


Re: Delete doesn't remove row key?

2012-01-31 Thread Todd Fast
First, thanks! I'd read that before, but didn't associate doing a range 
scan with using the CLI, much less doing select count(*) in CQL. Now I 
know what to call the phenomenon.


Second, a followup question: So the row keys will be deleted after 1) 
the GC grace period expires, and 2) I do a compaction?


Third: Assuming the answer is yes, is there any way to manually force GC 
of the deleted keys without doing the full GC shuffle (setting the GC 
grace period artificially low, restarting, compacting, setting grace 
period back to normal, restarting)?


Todd

On 1/31/2012 5:03 PM, Benjamin Hawkes-Lewis wrote:

On Wed, Feb 1, 2012 at 12:58 AM, Todd Fastt...@conga.com  wrote:

I added a row with a single column to my 1.0.8 single-node cluster:

RowKey: ----
=  (column=test, value=hi, timestamp=...)

I immediately deleted the row using both the CLI and CQL:

del Foo[lexicaluuid('----')];
delete from Foo using consistency all where
KEY=----

In either case, the column test is gone but the empty row key still
remains, and the row count reflects the presence of this phantom row.

I've tried nodetool compact/repair/flush/cleanup/scrub/etc. and nothing
removes the row key.

http://wiki.apache.org/cassandra/FAQ#range_ghosts

--
Benjamin Hawkes-Lewis


Mixed random ordered partitioning?

2012-01-25 Thread Todd Fast
I want to do ranged row queries for a few of my column families, but 
best practice seems to be to use the random partitioner. Splitting my 
column families between two clusters (one random, one ordered) seems 
like a pretty expensive compromise.


Instead, I'm thinking of using the order-preserving partitioner in my 
cluster, but distributing load for most of my column families by hashing 
the row keys in my application code. Then, for the few column families 
which I need to slice rows, I can just use unhashed keys.


What is be the effective difference between hashing the keys myself and 
letting the random partitioner do it? Is this advisable?


Thanks,
Todd



Rename a column family on 1.0.x

2011-11-01 Thread Todd Fast
Any advice for renaming a column family in version 1.0.x? Like most of 
the docs, the FAQ 
(http://wiki.apache.org/cassandra/FAQ#modify_cf_config) is out of date.


Todd