Still no success in figuring out how to make Pig load and/or store all of data in Cassandra db.. Maybe I just need to tweak some pig config settings somewhere? But I don't really know where (files in conf/ don't seem to be doing I can't really find other info online) to... I've included the messages I get when running. Just to restate the problem - when I load and then immediately dump/store data from Cassandra's row, I get about ~1500 columns while there are ~7500 of them in the actual db.
Thanks! PIG MESSAGES: (this is for a slightly different script, but the problem is the same) 2012-04-06 17:27:14,689 [main] INFO org.apache.pig.tools.parameters.PreprocessorContext - Executing command : date "+%y%m%d%H%M%S" 2012-04-06 17:27:14,907 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2012-04-06 17:27:15,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2012-04-06 17:27:16,061 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: FILTER 2012-04-06 17:27:16,287 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-04-06 17:27:16,321 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2012-04-06 17:27:16,321 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2012-04-06 17:27:16,412 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-04-06 17:27:16,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-04-06 17:27:16,430 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job3074512946558271384.jar 2012-04-06 17:27:25,849 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job3074512946558271384.jar created 2012-04-06 17:27:25,869 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2012-04-06 17:27:25,928 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-04-06 17:27:26,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2012-04-06 17:27:26,588 [Thread-4] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 2012-04-06 17:27:27,371 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201204061725_0001 2012-04-06 17:27:27,371 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201204061725_0001 2012-04-06 17:27:46,992 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete 2012-04-06 17:27:57,079 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2012-04-06 17:27:57,081 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.1 0.9.3-SNAPSHOT root 2012-04-06 17:27:16 2012-04-06 17:27:57 FILTER Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201204061725_0001 1 0 9 9 9 0 0 0 cols,filtered,filtered_rows,rows,super_cols MAP_ONLY hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144, Input(s): Successfully read 4 records (397 bytes) from: "cassandra://KeySpace/ColumnFamily" Output(s): Successfully stored 271 records (31076 bytes) in: "hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144" Counters: Total records written : 271 Total bytes written : 31076 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_201204061725_0001 2012-04-06 17:27:57,092 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! ****hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144 2012-04-06 17:27:57,109 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2012-04-06 17:27:57,109 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 ... On Mon, Apr 2, 2012 at 2:11 PM, Dan Feldman <[email protected]> wrote: > Managed to get MR mode running by specifying HADOOP_CLASSPATH in > $HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that.. > > In any case, it seems that Pig continues to misbehave both in MR and local > modes: counting rows produces 2 results while we know there 5 of them, > ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset > of recent data. > > > On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <[email protected]> wrote: > >> Looks like you don't have Thrift on your classpath, or the wrong >> version of thrift. >> >> Pig may be doing something weird with splits in local mode. It would >> be great if you could determine whether (assuming you fix the >> classpath) the problem happens in local mode only, or both in local >> and MR modes. >> >> D >> >> On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <[email protected]> wrote: >> > Hi Dmitriy, >> > >> > Apologies for the delay - our server was misbehaving so it took a while >> to >> > get everything set up on a new one. In any case, we basically cloned >> > Cassandra from the old one to the new one - running Pig in local mode >> still >> > produces wrong number of results. Now, we never ran the scripts in MR >> mode, >> > so I don't know whether this is related to the original problem or not, >> but >> > this is the error I get when running on top of hadoop: >> > >> > >> =============================================================================== >> > .... >> > *2012-04-01 21:32:38,781 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - job job_201203301228_0003 has failed! Stop running all dependent jobs >> > 2012-04-01 21:32:38,782 [main] INFO >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> > - 100% complete >> > 2012-04-01 21:32:38,790 [main] ERROR >> > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to >> > recreate exception from backed error: Error: >> > java.lang.ClassNotFoundException: org.apache.thrift.TException >> > 2012-04-01 21:32:38,790 [main] ERROR >> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! >> > 2012-04-01 21:32:38,791 [main] INFO >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*: >> > .... >> > >> ================================================================================ >> > >> > Thanks, >> > Dan F >> > >> > >> > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <[email protected]> >> wrote: >> > >> >> What happens when you run in MR mode instead of local mode? >> >> >> >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]> >> >> wrote: >> >> > Hi, >> >> > >> >> > I'm loading a bunch of data into Pig using CassandraStorage. When I >> do a >> >> > dump and/or store, the amount of data that is outputted is actually >> only >> >> > 2-3% of the amount of data in Cassandra database. >> >> > >> >> > My Cassandra data consists of (for now) 4-5 wide rows where each data >> >> entry >> >> > is a super column ordered by TimeUUID. >> >> > >> >> > So, my script now looks like >> >> > >> >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING >> CassandraStorage() >> >> AS >> >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name, >> >> > value)})}); >> >> > store rows into 'directory/test'; >> >> > >> >> > The output that I get when I run the script looks like this (I >> >> highlighted >> >> > the warnings): >> >> > >> >> > >> >> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> > *2012-03-29 11:10:58,063 [main] INFO org.apache.pig.Main - Logging >> error >> >> > messages to: /directory/pig_1333044658058.log >> >> > 2012-03-29 11:10:58,105 [main] INFO >> >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing >> command : >> >> > date "+%y%m%d%H%M%S" >> >> > 2012-03-29 11:10:58,268 [main] INFO >> >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> >> Connecting >> >> > to hadoop file system at: file:/// >> >> > 2012-03-29 11:10:59,018 [main] INFO >> >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the >> >> > script: UNKNOWN >> >> > 2012-03-29 11:10:59,182 [main] INFO >> >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - >> >> > File concatenation threshold: 100 optimistic? false >> >> > 2012-03-29 11:10:59,211 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >> >> > - MR plan size before optimization: 1 >> >> > 2012-03-29 11:10:59,211 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >> >> > - MR plan size after optimization: 1 >> >> > 2012-03-29 11:10:59,251 [main] INFO >> >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are >> added >> >> > to the job >> >> > 2012-03-29 11:10:59,269 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to >> default >> >> 0.3 >> >> > 2012-03-29 11:10:59,292 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler >> >> > - Setting up single store job >> >> > 2012-03-29 11:10:59,334 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> > - 1 map-reduce job(s) waiting for submission. >> >> > 2012-03-29 11:10:59,361 [Thread-1] WARN >> >> org.apache.hadoop.mapred.JobClient >> >> > - No job jar file set. User classes may not be found. See >> JobConf(Class) >> >> > or JobConf#setJar(String). >> >> > 2012-03-29 11:10:59,437 [Thread-1] INFO >> >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> input >> >> > paths (combined) to process : 1 >> >> > 2012-03-29 11:10:59,836 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> > - HadoopJobId: job_local_0001 >> >> > 2012-03-29 11:10:59,836 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> > - 0% complete >> >> > 2012-03-29 11:11:01,185 [Thread-2] INFO >> org.apache.hadoop.mapred.Task - >> >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of >> >> > commiting >> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO >> >> > org.apache.hadoop.mapred.LocalJobRunner - >> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO >> org.apache.hadoop.mapred.Task - >> >> > Task attempt_local_0001_m_000000_0 is allowed to commit now >> >> > 2012-03-29 11:11:01,192 [Thread-2] INFO >> >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved >> output >> >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test >> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO >> >> > org.apache.hadoop.mapred.LocalJobRunner - >> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO >> org.apache.hadoop.mapred.Task - >> >> > Task 'attempt_local_0001_m_000000_0' done. >> >> > 2012-03-29 11:11:04,842 [main] WARN >> >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get >> RunningJob for >> >> > job job_local_0001 >> >> > 2012-03-29 11:11:04,845 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> > - 100% complete >> >> > 2012-03-29 11:11:04,845 [main] INFO >> >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode. >> Stats >> >> > reported below may be incomplete >> >> > 2012-03-29 11:11:04,847 [main] INFO >> >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: >> >> > >> >> > HadoopVersion PigVersion UserId StartedAt FinishedAt >> >> Features >> >> > 0.20.203.0 0.9.3-SNAPSHOT root 2012-03-29 11:10:59 >> 2012-03-29 >> >> > 11:11:04 UNKNOWN >> >> > >> >> > Success! >> >> > >> >> > Job Stats (time in seconds): >> >> > JobId Alias Feature Outputs >> >> > job_local_0001 rows MAP_ONLY file:///root/directory/test, >> >> > >> >> > Input(s): >> >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily" >> >> > >> >> > Output(s): >> >> > Successfully stored records in: "file:///root/directory/test" >> >> > >> >> > Job DAG: >> >> > job_local_0001 >> >> > >> >> > >> >> > 2012-03-29 11:11:04,849 [main] INFO >> >> > >> >> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> >> > - Success!* >> >> > >> >> > >> >> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> > >> >> > >> >> > Now, I don't know whether it's related or not to the problem, but I >> >> > recently noticed that ILLUSTRATE dumps the data to the terminal >> before >> >> > actually illustrating the schema. It outputs the same amount of data >> >> (about >> >> > 2-3% of the total) as it would if I just ran DUMP or STORE. >> >> > >> >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8 >> >> > >> >> > >> >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by >> >> Matt >> >> > in >> >> > >> >> >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html >> >> , >> >> > but it didn't help... >> >> > >> >> > >> >> > Thanks for your help! >> >> > Dan F. >> >> >> > >
