What happens when you run in MR mode instead of local mode?
On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]> wrote: > Hi, > > I'm loading a bunch of data into Pig using CassandraStorage. When I do a > dump and/or store, the amount of data that is outputted is actually only > 2-3% of the amount of data in Cassandra database. > > My Cassandra data consists of (for now) 4-5 wide rows where each data entry > is a super column ordered by TimeUUID. > > So, my script now looks like > > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage() AS > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name, > value)})}); > store rows into 'directory/test'; > > The output that I get when I run the script looks like this (I highlighted > the warnings): > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > *2012-03-29 11:10:58,063 [main] INFO org.apache.pig.Main - Logging error > messages to: /directory/pig_1333044658058.log > 2012-03-29 11:10:58,105 [main] INFO > org.apache.pig.tools.parameters.PreprocessorContext - Executing command : > date "+%y%m%d%H%M%S" > 2012-03-29 11:10:58,268 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting > to hadoop file system at: file:/// > 2012-03-29 11:10:59,018 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: UNKNOWN > 2012-03-29 11:10:59,182 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - > File concatenation threshold: 100 optimistic? false > 2012-03-29 11:10:59,211 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size before optimization: 1 > 2012-03-29 11:10:59,211 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > - MR plan size after optimization: 1 > 2012-03-29 11:10:59,251 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added > to the job > 2012-03-29 11:10:59,269 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 > 2012-03-29 11:10:59,292 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > - Setting up single store job > 2012-03-29 11:10:59,334 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 1 map-reduce job(s) waiting for submission. > 2012-03-29 11:10:59,361 [Thread-1] WARN org.apache.hadoop.mapred.JobClient > - No job jar file set. User classes may not be found. See JobConf(Class) > or JobConf#setJar(String). > 2012-03-29 11:10:59,437 [Thread-1] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths (combined) to process : 1 > 2012-03-29 11:10:59,836 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - HadoopJobId: job_local_0001 > 2012-03-29 11:10:59,836 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 0% complete > 2012-03-29 11:11:01,185 [Thread-2] INFO org.apache.hadoop.mapred.Task - > Task:attempt_local_0001_m_000000_0 is done. And is in the process of > commiting > 2012-03-29 11:11:01,189 [Thread-2] INFO > org.apache.hadoop.mapred.LocalJobRunner - > 2012-03-29 11:11:01,189 [Thread-2] INFO org.apache.hadoop.mapred.Task - > Task attempt_local_0001_m_000000_0 is allowed to commit now > 2012-03-29 11:11:01,192 [Thread-2] INFO > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test > 2012-03-29 11:11:02,714 [Thread-2] INFO > org.apache.hadoop.mapred.LocalJobRunner - > 2012-03-29 11:11:02,714 [Thread-2] INFO org.apache.hadoop.mapred.Task - > Task 'attempt_local_0001_m_000000_0' done. > 2012-03-29 11:11:04,842 [main] WARN > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for > job job_local_0001 > 2012-03-29 11:11:04,845 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > 2012-03-29 11:11:04,845 [main] INFO > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. Stats > reported below may be incomplete > 2012-03-29 11:11:04,847 [main] INFO > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt Features > 0.20.203.0 0.9.3-SNAPSHOT root 2012-03-29 11:10:59 2012-03-29 > 11:11:04 UNKNOWN > > Success! > > Job Stats (time in seconds): > JobId Alias Feature Outputs > job_local_0001 rows MAP_ONLY file:///root/directory/test, > > Input(s): > Successfully read records from: "cassandra://Keyspace/ColumnFamily" > > Output(s): > Successfully stored records in: "file:///root/directory/test" > > Job DAG: > job_local_0001 > > > 2012-03-29 11:11:04,849 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success!* > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > Now, I don't know whether it's related or not to the problem, but I > recently noticed that ILLUSTRATE dumps the data to the terminal before > actually illustrating the schema. It outputs the same amount of data (about > 2-3% of the total) as it would if I just ran DUMP or STORE. > > I'm using Pig 0.93 in local mode with Cassandra 1.0.8 > > > P.S. I tried setting -Dpig.splitCombination=false as was suggested by Matt > in > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html, > but it didn't help... > > > Thanks for your help! > Dan F.
