Hmmm, for wide rows, you can page it with I believe some changes on 0.7 branch that made it in as part of https://issues.apache.org/jira/browse/CASSANDRA-1618 recently. Specifically, using the 0.7 branch version of CassandraStorage, you can specify it using this basic template: cassandra://<keyspace>/<columnfamily>[?slice_start=<start>&slice_end=<end>[&reversed=true][&limit=1]] That goes in your pig LOAD block. So it's a pain to do what you're doing I would imagine but it's possible to page in the latest on 0.7 branch.
On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote: > It looks like this functionality is not in the 0.7.3 version of > CassandraStorage. I tried to add the constructor which takes the limit to the > class, but I ran into some Pig parsing errors, so I had to make the parameter > a string. How did you get around this for the version of CassandraStorage in > trunk? I'm running Pig 0.8.0. > > Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra > starts eating up huge amounts of memory, maxing out my 16GB heap size. I > suspect this is because of the get_range_slices() call from > ColumnFamilyRecordReader. Are there plans to make this streaming/paged? > > -Jeffrey > > -----Original Message----- > From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] > Sent: Thursday, March 24, 2011 11:34 AM > To: user@cassandra.apache.org > Subject: Re: pig counting question > > The limit defaults to 1024 but you can set it when you use CassandraStorage > in pig, like so: > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096); > or whatever value you wish. > > Give that a try and see if it gives you more of what you're looking for. > > On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote: > >> Hey all, >> >> I'm trying to run a very simple Pig script against my Cassandra cluster (5 >> nodes, 0.7.3). I've gotten it all set up and working, but the script is >> giving me some strange results. Here is my script: >> >> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(); >> rowct = FOREACH rows GENERATE $0, COUNT($1); >> dump rowct; >> >> If I understand Pig correctly, this should output (row name, column count) >> tuples, but I'm always seeing 1024 for the column count even though the rows >> have highly variable number of columns. Am I missing something? Thanks. >> >> -Jeffrey >> >