Well, I suppose I should say how we "solved" the problem (but not really)
in case someone in the future runs into similar issue...
Because writing a UDF wasn't really an option and there wasn't too much
data in Cassandra anyways, we decided to use our own generated time-based
uuids by which to store (super)columns in Cassandra in UTF8Type format:
before each store, our server running ruby gets the current time stamp in
YYMMDDMMSS + 8-character long randomly generated string, so that each super
column looks something like this
120326134516-DLFPASDF
and is pretty much guaranteed to be unique.
Since this is a string, I can then slice the columns when importing last
hour's data into pig for analysis every hour or so:
%DECLARE now `date`;
%DECLARE S `date -d "$now - 1 hour" "+%Y%m%d%H%M%S-00000000"`;
data = LOAD
'cassandra://Keyspace/ColumnFamily?slice_start=$S&comparator=AsciiType'
USING CassandraStorage() AS (key, columns: bag{(t:chararray, subcolumns:
bag{(name, value)})});
Clearly, this is an ugly-ugly "fix" because as traffic on our website
grows, I'm pretty sure our rails server won't be able to handle concurrent
calls to `date` as well as Cassandra could.... BUT, as it turns out in
production environment, an ugly fix now as opposed to a better fix later is
the only way to go..
Dan F.
On Fri, Mar 23, 2012 at 6:24 PM, Dan Feldman <[email protected]> wrote:
> Hi everyone,
>
> I have a Cassandra SCF where each super column has a name which is
> dynamically assigned as TimeUUID at the time that that super column was
> inserted into the database:
>
> create column family CF
> with key_validation_class = UTF8Type
> and comparator = TimeUUIDType
> and subcomparator = UTF8Type
> and column_type = 'Super';
>
> Now, I'm trying to write a Pig script that would automatically calculate
> the number of new super columns added to the database during specified
> period of time (let's say, in the last hour). For that, I thought it would
> be nice to be able to do something along the lines of:
>
> last_hour_data = LOAD
> 'cassandra://Keyspace/ColumnFamily&slice_start=Time(one hour
> ago)&slice_end=Time(now)' USING CassandraStorage()...
>
> However,
> 1) I'm not sure what that "Time(one hour ago)" and "Time(now)" syntax is
> (so that it would translate those times into TimeUUIDs that cassandra
> understands) and
> 2) The LOAD line above that I took from the bottom of
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/README.txtproduces
> an error thinking that 'CF&slice_start...' is one gigantic column
> family name (which of course does not exist).
>
>
> Alternatively, I could try generating my specified range of columns in Pig
> after loading the whole database. But looking at the data, the super column
> names look like 'S.?,uF? ?B#q' or ' ??VuI??-gFd?' instead of
> "normal-looking" UUIDs like '275564bc4f52f81573b4cfe0ea615ae0', even when I
> try to load the super column names as chararrays. I'm thinking it's because
> the latter representation of UUID differs from its string representation,
> but is there a way to load it into Pig the "normal-looking" way?
>
>
> Thank you in advance for your time!
> Dan F.
>
>