Another tip:
If you parametrize your load statements, it becomes easy to switch
between loading from something like Cassandra, and reading from HDFS
or local fs directly.

Also:
Try using Pig's "illustrate" command when working through your flows
-- it does some clever things that go far beyond simple random
sampling of source data, in order to ensure that you can see the
effects of doing filters, that joins get (possibly artificial)
matching keys even if you sampled in a way that didn't actually
produce any, etc.

D

On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
<[email protected]> wrote:
> We started doing this recently and thought it might be useful to others.
>
> Pig (and Hive) have a sample function that allows you to sample data from 
> your data store.
>
> In pig it looks something like this:
> mysample = SAMPLE myrelation 0.01;
>
> One possible use for this, with pig and cassandra is to solve a conundrum of 
> testing locally.  We've wondered how to do this so we decided to do sampling 
> of a column family (or set of CFs), store into HDFS (or CFS), download 
> locally, then import into your local Cassandra node.  That gives you real 
> data to test against with pig/hive or for other purposes.
>
> That way, when you're flying out to the Hadoop Summit or the Cassandra SF 
> event, you can play with real data :).
>
> Maybe others have been doing this for years, but if not, we're finding it 
> handy.
>
> Jeremy

Reply via email to