Re: Pig not storing/loading Cassandra data properly

Dmitriy Ryaboy Thu, 29 Mar 2012 18:21:19 -0700

What happens when you run in MR mode instead of local mode?


On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]> wrote:
> Hi,
>
> I'm loading a bunch of data into Pig using CassandraStorage. When I do a
> dump and/or store, the amount of data that is outputted is actually only
> 2-3% of the amount of data in Cassandra database.
>
> My Cassandra data consists of (for now) 4-5 wide rows where each data entry
> is a super column ordered by TimeUUID.
>
> So, my script now looks like
>
> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage() AS
> (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> value)})});
> store rows into 'directory/test';
>
> The output that I get when I run the script looks like this (I highlighted
> the warnings):
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /directory/pig_1333044658058.log
> 2012-03-29 11:10:58,105 [main] INFO
> org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
> date "+%y%m%d%H%M%S"
> 2012-03-29 11:10:58,268 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: file:///
> 2012-03-29 11:10:59,018 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: UNKNOWN
> 2012-03-29 11:10:59,182 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2012-03-29 11:10:59,211 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2012-03-29 11:10:59,251 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
> 2012-03-29 11:10:59,269 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
> 2012-03-29 11:10:59,292 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 2012-03-29 11:10:59,334 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 1 map-reduce job(s) waiting for submission.
> 2012-03-29 11:10:59,361 [Thread-1] WARN  org.apache.hadoop.mapred.JobClient
> - No job jar file set.  User classes may not be found. See JobConf(Class)
> or JobConf#setJar(String).
> 2012-03-29 11:10:59,437 [Thread-1] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths (combined) to process : 1
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0001
> 2012-03-29 11:10:59,836 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 0% complete
> 2012-03-29 11:11:01,185 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> commiting
> 2012-03-29 11:11:01,189 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:01,189 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task attempt_local_0001_m_000000_0 is allowed to commit now
> 2012-03-29 11:11:01,192 [Thread-2] INFO
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output
> of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> 2012-03-29 11:11:02,714 [Thread-2] INFO
> org.apache.hadoop.mapred.LocalJobRunner -
> 2012-03-29 11:11:02,714 [Thread-2] INFO  org.apache.hadoop.mapred.Task -
> Task 'attempt_local_0001_m_000000_0' done.
> 2012-03-29 11:11:04,842 [main] WARN
> org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for
> job job_local_0001
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - 100% complete
> 2012-03-29 11:11:04,845 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats
> reported below may be incomplete
> 2012-03-29 11:11:04,847 [main] INFO
> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>
> HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
> 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59    2012-03-29
> 11:11:04    UNKNOWN
>
> Success!
>
> Job Stats (time in seconds):
> JobId    Alias    Feature    Outputs
> job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>
> Input(s):
> Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>
> Output(s):
> Successfully stored records in: "file:///root/directory/test"
>
> Job DAG:
> job_local_0001
>
>
> 2012-03-29 11:11:04,849 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Success!*
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Now, I don't know whether it's related or not to the problem, but I
> recently noticed that ILLUSTRATE dumps the data to the terminal before
> actually illustrating the schema. It outputs the same amount of data (about
> 2-3% of the total) as it would if I just ran DUMP or STORE.
>
> I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>
>
> P.S. I tried setting -Dpig.splitCombination=false as was suggested by Matt
> in
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html,
> but it didn't help...
>
>
> Thanks for your help!
> Dan F.

Re: Pig not storing/loading Cassandra data properly

Reply via email to