Managed to get MR mode running by specifying HADOOP_CLASSPATH in $HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that..
In any case, it seems that Pig continues to misbehave both in MR and local modes: counting rows produces 2 results while we know there 5 of them, ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset of recent data. On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <[email protected]> wrote: > Looks like you don't have Thrift on your classpath, or the wrong > version of thrift. > > Pig may be doing something weird with splits in local mode. It would > be great if you could determine whether (assuming you fix the > classpath) the problem happens in local mode only, or both in local > and MR modes. > > D > > On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <[email protected]> wrote: > > Hi Dmitriy, > > > > Apologies for the delay - our server was misbehaving so it took a while > to > > get everything set up on a new one. In any case, we basically cloned > > Cassandra from the old one to the new one - running Pig in local mode > still > > produces wrong number of results. Now, we never ran the scripts in MR > mode, > > so I don't know whether this is related to the original problem or not, > but > > this is the error I get when running on top of hadoop: > > > > > =============================================================================== > > .... > > *2012-04-01 21:32:38,781 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - job job_201203301228_0003 has failed! Stop running all dependent jobs > > 2012-04-01 21:32:38,782 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - 100% complete > > 2012-04-01 21:32:38,790 [main] ERROR > > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to > > recreate exception from backed error: Error: > > java.lang.ClassNotFoundException: org.apache.thrift.TException > > 2012-04-01 21:32:38,790 [main] ERROR > > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! > > 2012-04-01 21:32:38,791 [main] INFO > > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*: > > .... > > > ================================================================================ > > > > Thanks, > > Dan F > > > > > > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <[email protected]> > wrote: > > > >> What happens when you run in MR mode instead of local mode? > >> > >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]> > >> wrote: > >> > Hi, > >> > > >> > I'm loading a bunch of data into Pig using CassandraStorage. When I > do a > >> > dump and/or store, the amount of data that is outputted is actually > only > >> > 2-3% of the amount of data in Cassandra database. > >> > > >> > My Cassandra data consists of (for now) 4-5 wide rows where each data > >> entry > >> > is a super column ordered by TimeUUID. > >> > > >> > So, my script now looks like > >> > > >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING > CassandraStorage() > >> AS > >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name, > >> > value)})}); > >> > store rows into 'directory/test'; > >> > > >> > The output that I get when I run the script looks like this (I > >> highlighted > >> > the warnings): > >> > > >> > > >> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> > *2012-03-29 11:10:58,063 [main] INFO org.apache.pig.Main - Logging > error > >> > messages to: /directory/pig_1333044658058.log > >> > 2012-03-29 11:10:58,105 [main] INFO > >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing > command : > >> > date "+%y%m%d%H%M%S" > >> > 2012-03-29 11:10:58,268 [main] INFO > >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > >> Connecting > >> > to hadoop file system at: file:/// > >> > 2012-03-29 11:10:59,018 [main] INFO > >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > >> > script: UNKNOWN > >> > 2012-03-29 11:10:59,182 [main] INFO > >> > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - > >> > File concatenation threshold: 100 optimistic? false > >> > 2012-03-29 11:10:59,211 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > >> > - MR plan size before optimization: 1 > >> > 2012-03-29 11:10:59,211 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer > >> > - MR plan size after optimization: 1 > >> > 2012-03-29 11:10:59,251 [main] INFO > >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are > added > >> > to the job > >> > 2012-03-29 11:10:59,269 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to > default > >> 0.3 > >> > 2012-03-29 11:10:59,292 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler > >> > - Setting up single store job > >> > 2012-03-29 11:10:59,334 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> > - 1 map-reduce job(s) waiting for submission. > >> > 2012-03-29 11:10:59,361 [Thread-1] WARN > >> org.apache.hadoop.mapred.JobClient > >> > - No job jar file set. User classes may not be found. See > JobConf(Class) > >> > or JobConf#setJar(String). > >> > 2012-03-29 11:10:59,437 [Thread-1] INFO > >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input > >> > paths (combined) to process : 1 > >> > 2012-03-29 11:10:59,836 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> > - HadoopJobId: job_local_0001 > >> > 2012-03-29 11:10:59,836 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> > - 0% complete > >> > 2012-03-29 11:11:01,185 [Thread-2] INFO > org.apache.hadoop.mapred.Task - > >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of > >> > commiting > >> > 2012-03-29 11:11:01,189 [Thread-2] INFO > >> > org.apache.hadoop.mapred.LocalJobRunner - > >> > 2012-03-29 11:11:01,189 [Thread-2] INFO > org.apache.hadoop.mapred.Task - > >> > Task attempt_local_0001_m_000000_0 is allowed to commit now > >> > 2012-03-29 11:11:01,192 [Thread-2] INFO > >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved > output > >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test > >> > 2012-03-29 11:11:02,714 [Thread-2] INFO > >> > org.apache.hadoop.mapred.LocalJobRunner - > >> > 2012-03-29 11:11:02,714 [Thread-2] INFO > org.apache.hadoop.mapred.Task - > >> > Task 'attempt_local_0001_m_000000_0' done. > >> > 2012-03-29 11:11:04,842 [main] WARN > >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob > for > >> > job job_local_0001 > >> > 2012-03-29 11:11:04,845 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> > - 100% complete > >> > 2012-03-29 11:11:04,845 [main] INFO > >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. > Stats > >> > reported below may be incomplete > >> > 2012-03-29 11:11:04,847 [main] INFO > >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: > >> > > >> > HadoopVersion PigVersion UserId StartedAt FinishedAt > >> Features > >> > 0.20.203.0 0.9.3-SNAPSHOT root 2012-03-29 11:10:59 > 2012-03-29 > >> > 11:11:04 UNKNOWN > >> > > >> > Success! > >> > > >> > Job Stats (time in seconds): > >> > JobId Alias Feature Outputs > >> > job_local_0001 rows MAP_ONLY file:///root/directory/test, > >> > > >> > Input(s): > >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily" > >> > > >> > Output(s): > >> > Successfully stored records in: "file:///root/directory/test" > >> > > >> > Job DAG: > >> > job_local_0001 > >> > > >> > > >> > 2012-03-29 11:11:04,849 [main] INFO > >> > > >> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > >> > - Success!* > >> > > >> > > >> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> > > >> > > >> > Now, I don't know whether it's related or not to the problem, but I > >> > recently noticed that ILLUSTRATE dumps the data to the terminal before > >> > actually illustrating the schema. It outputs the same amount of data > >> (about > >> > 2-3% of the total) as it would if I just ran DUMP or STORE. > >> > > >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8 > >> > > >> > > >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by > >> Matt > >> > in > >> > > >> > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html > >> , > >> > but it didn't help... > >> > > >> > > >> > Thanks for your help! > >> > Dan F. > >> >
