Re: Pig not storing/loading Cassandra data properly

Dan Feldman Mon, 02 Apr 2012 14:12:04 -0700

Managed to get MR mode running by specifying HADOOP_CLASSPATH in
$HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that..


In any case, it seems that Pig continues to misbehave both in MR and local
modes: counting rows produces 2 results while we know there 5 of them,
ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset
of recent data.

On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <[email protected]> wrote:

> Looks like you don't have Thrift on your classpath, or the wrong
> version of thrift.
>
> Pig may be doing something weird with splits in local mode. It would
> be great if you could determine whether (assuming you fix the
> classpath) the problem happens in local mode only, or both in local
> and MR modes.
>
> D
>
> On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <[email protected]> wrote:
> > Hi Dmitriy,
> >
> > Apologies for the delay - our server was misbehaving so it took a while
> to
> > get everything set up on a new one. In any case, we basically cloned
> > Cassandra from the old one to the new one - running Pig in local mode
> still
> > produces wrong number of results. Now, we never ran the scripts in MR
> mode,
> > so I don't know whether this is related to the original problem or not,
> but
> > this is the error I get when running on top of hadoop:
> >
> >
> ===============================================================================
> > ....
> > *2012-04-01 21:32:38,781 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - job job_201203301228_0003 has failed! Stop running all dependent jobs
> > 2012-04-01 21:32:38,782 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 100% complete
> > 2012-04-01 21:32:38,790 [main] ERROR
> > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
> > recreate exception from backed error: Error:
> > java.lang.ClassNotFoundException: org.apache.thrift.TException
> > 2012-04-01 21:32:38,790 [main] ERROR
> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
> > 2012-04-01 21:32:38,791 [main] INFO
> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
> > ....
> >
> ================================================================================
> >
> > Thanks,
> > Dan F
> >
> >
> > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> >
> >> What happens when you run in MR mode instead of local mode?
> >>
> >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]>
> >> wrote:
> >> > Hi,
> >> >
> >> > I'm loading a bunch of data into Pig using CassandraStorage. When I
> do a
> >> > dump and/or store, the amount of data that is outputted is actually
> only
> >> > 2-3% of the amount of data in Cassandra database.
> >> >
> >> > My Cassandra data consists of (for now) 4-5 wide rows where each data
> >> entry
> >> > is a super column ordered by TimeUUID.
> >> >
> >> > So, my script now looks like
> >> >
> >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING
> CassandraStorage()
> >> AS
> >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
> >> > value)})});
> >> > store rows into 'directory/test';
> >> >
> >> > The output that I get when I run the script looks like this (I
> >> highlighted
> >> > the warnings):
> >> >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging
> error
> >> > messages to: /directory/pig_1333044658058.log
> >> > 2012-03-29 11:10:58,105 [main] INFO
> >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing
> command :
> >> > date "+%y%m%d%H%M%S"
> >> > 2012-03-29 11:10:58,268 [main] INFO
> >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> >> Connecting
> >> > to hadoop file system at: file:///
> >> > 2012-03-29 11:10:59,018 [main] INFO
> >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> >> > script: UNKNOWN
> >> > 2012-03-29 11:10:59,182 [main] INFO
> >> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> >> > File concatenation threshold: 100 optimistic? false
> >> > 2012-03-29 11:10:59,211 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size before optimization: 1
> >> > 2012-03-29 11:10:59,211 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> >> > - MR plan size after optimization: 1
> >> > 2012-03-29 11:10:59,251 [main] INFO
> >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
> added
> >> > to the job
> >> > 2012-03-29 11:10:59,269 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to
> default
> >> 0.3
> >> > 2012-03-29 11:10:59,292 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> >> > - Setting up single store job
> >> > 2012-03-29 11:10:59,334 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 1 map-reduce job(s) waiting for submission.
> >> > 2012-03-29 11:10:59,361 [Thread-1] WARN
> >>  org.apache.hadoop.mapred.JobClient
> >> > - No job jar file set.  User classes may not be found. See
> JobConf(Class)
> >> > or JobConf#setJar(String).
> >> > 2012-03-29 11:10:59,437 [Thread-1] INFO
> >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> >> input
> >> > paths (combined) to process : 1
> >> > 2012-03-29 11:10:59,836 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - HadoopJobId: job_local_0001
> >> > 2012-03-29 11:10:59,836 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 0% complete
> >> > 2012-03-29 11:11:01,185 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
> >> > commiting
> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
> >> > org.apache.hadoop.mapred.LocalJobRunner -
> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task attempt_local_0001_m_000000_0 is allowed to commit now
> >> > 2012-03-29 11:11:01,192 [Thread-2] INFO
> >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved
> output
> >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
> >> > org.apache.hadoop.mapred.LocalJobRunner -
> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>  org.apache.hadoop.mapred.Task -
> >> > Task 'attempt_local_0001_m_000000_0' done.
> >> > 2012-03-29 11:11:04,842 [main] WARN
> >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob
> for
> >> > job job_local_0001
> >> > 2012-03-29 11:11:04,845 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - 100% complete
> >> > 2012-03-29 11:11:04,845 [main] INFO
> >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode.
> Stats
> >> > reported below may be incomplete
> >> > 2012-03-29 11:11:04,847 [main] INFO
> >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
> >> >
> >> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
> >>  Features
> >> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59
>  2012-03-29
> >> > 11:11:04    UNKNOWN
> >> >
> >> > Success!
> >> >
> >> > Job Stats (time in seconds):
> >> > JobId    Alias    Feature    Outputs
> >> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
> >> >
> >> > Input(s):
> >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
> >> >
> >> > Output(s):
> >> > Successfully stored records in: "file:///root/directory/test"
> >> >
> >> > Job DAG:
> >> > job_local_0001
> >> >
> >> >
> >> > 2012-03-29 11:11:04,849 [main] INFO
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> >> > - Success!*
> >> >
> >> >
> >>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >> >
> >> >
> >> > Now, I don't know whether it's related or not to the problem, but I
> >> > recently noticed that ILLUSTRATE dumps the data to the terminal before
> >> > actually illustrating the schema. It outputs the same amount of data
> >> (about
> >> > 2-3% of the total) as it would if I just ran DUMP or STORE.
> >> >
> >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
> >> >
> >> >
> >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
> >> Matt
> >> > in
> >> >
> >>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
> >> ,
> >> > but it didn't help...
> >> >
> >> >
> >> > Thanks for your help!
> >> > Dan F.
> >>
>

Re: Pig not storing/loading Cassandra data properly

Reply via email to