Re: Pig not storing/loading Cassandra data properly

Dan Feldman Fri, 06 Apr 2012 18:23:36 -0700

Still no success in figuring out how to make Pig load and/or store all of
data in Cassandra db.. Maybe I just need to tweak some pig config settings
somewhere? But I don't really know where (files in conf/ don't seem to be
doing I can't really find other info online) to... I've included the
messages I get when running. Just to restate the problem - when I load and
then immediately dump/store data from Cassandra's row, I get about ~1500
columns while there are ~7500 of them in the actual db.


Thanks!

PIG MESSAGES: (this is for a slightly different script, but the problem is
the same)

2012-04-06 17:27:14,689 [main] INFO
org.apache.pig.tools.parameters.PreprocessorContext - Executing command :
date "+%y%m%d%H%M%S"
2012-04-06 17:27:14,907 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: hdfs://localhost:9000
2012-04-06 17:27:15,252 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to map-reduce job tracker at: localhost:9001
2012-04-06 17:27:16,061 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: FILTER
2012-04-06 17:27:16,287 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-04-06 17:27:16,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-04-06 17:27:16,321 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-04-06 17:27:16,412 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-04-06 17:27:16,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-04-06 17:27:16,430 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- creating jar file Job3074512946558271384.jar
2012-04-06 17:27:25,849 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- jar file Job3074512946558271384.jar created
2012-04-06 17:27:25,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2012-04-06 17:27:25,928 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
2012-04-06 17:27:26,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2012-04-06 17:27:26,588 [Thread-4] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths (combined) to process : 1
2012-04-06 17:27:27,371 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201204061725_0001
2012-04-06 17:27:27,371 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at:
http://localhost:50030/jobdetails.jsp?jobid=job_201204061725_0001
2012-04-06 17:27:46,992 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete
2012-04-06 17:27:57,079 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2012-04-06 17:27:57,081 [main] INFO
org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
1.0.1    0.9.3-SNAPSHOT    root    2012-04-06 17:27:16    2012-04-06
17:27:57    FILTER

Success!

Job Stats (time in seconds):
JobId    Maps    Reduces    MaxMapTime    MinMapTIme    AvgMapTime
MaxReduceTime    MinReduceTime    AvgReduceTime    Alias    Feature
Outputs
job_201204061725_0001    1    0    9    9    9    0    0    0
cols,filtered,filtered_rows,rows,super_cols    MAP_ONLY
hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144,

Input(s):
Successfully read 4 records (397 bytes) from:
"cassandra://KeySpace/ColumnFamily"

Output(s):
Successfully stored 271 records (31076 bytes) in:
"hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144"

Counters:
Total records written : 271
Total bytes written : 31076
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201204061725_0001


2012-04-06 17:27:57,092 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
****hdfs://localhost:9000/tmp/temp-1281322067/tmp-1081464144
2012-04-06 17:27:57,109 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1
2012-04-06 17:27:57,109 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1
...


On Mon, Apr 2, 2012 at 2:11 PM, Dan Feldman <[email protected]> wrote:

> Managed to get MR mode running by specifying HADOOP_CLASSPATH in
> $HADOOP_HOME/conf/hadoop_env.sh and restarting hadoop after that..
>
> In any case, it seems that Pig continues to misbehave both in MR and local
> modes: counting rows produces 2 results while we know there 5 of them,
> ILLUSTRATING dumps recent data to grunt, and STORING only saves some subset
> of recent data.
>
>
> On Mon, Apr 2, 2012 at 9:54 AM, Dmitriy Ryaboy <[email protected]> wrote:
>
>> Looks like you don't have Thrift on your classpath, or the wrong
>> version of thrift.
>>
>> Pig may be doing something weird with splits in local mode. It would
>> be great if you could determine whether (assuming you fix the
>> classpath) the problem happens in local mode only, or both in local
>> and MR modes.
>>
>> D
>>
>> On Sun, Apr 1, 2012 at 9:49 PM, Dan Feldman <[email protected]> wrote:
>> > Hi Dmitriy,
>> >
>> > Apologies for the delay - our server was misbehaving so it took a while
>> to
>> > get everything set up on a new one. In any case, we basically cloned
>> > Cassandra from the old one to the new one - running Pig in local mode
>> still
>> > produces wrong number of results. Now, we never ran the scripts in MR
>> mode,
>> > so I don't know whether this is related to the original problem or not,
>> but
>> > this is the error I get when running on top of hadoop:
>> >
>> >
>> ===============================================================================
>> > ....
>> > *2012-04-01 21:32:38,781 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - job job_201203301228_0003 has failed! Stop running all dependent jobs
>> > 2012-04-01 21:32:38,782 [main] INFO
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> > - 100% complete
>> > 2012-04-01 21:32:38,790 [main] ERROR
>> > org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to
>> > recreate exception from backed error: Error:
>> > java.lang.ClassNotFoundException: org.apache.thrift.TException
>> > 2012-04-01 21:32:38,790 [main] ERROR
>> > org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
>> > 2012-04-01 21:32:38,791 [main] INFO
>> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics*:
>> > ....
>> >
>> ================================================================================
>> >
>> > Thanks,
>> > Dan F
>> >
>> >
>> > On Thu, Mar 29, 2012 at 6:20 PM, Dmitriy Ryaboy <[email protected]>
>> wrote:
>> >
>> >> What happens when you run in MR mode instead of local mode?
>> >>
>> >> On Thu, Mar 29, 2012 at 11:24 AM, Dan Feldman <[email protected]>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm loading a bunch of data into Pig using CassandraStorage. When I
>> do a
>> >> > dump and/or store, the amount of data that is outputted is actually
>> only
>> >> > 2-3% of the amount of data in Cassandra database.
>> >> >
>> >> > My Cassandra data consists of (for now) 4-5 wide rows where each data
>> >> entry
>> >> > is a super column ordered by TimeUUID.
>> >> >
>> >> > So, my script now looks like
>> >> >
>> >> > rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING
>> CassandraStorage()
>> >> AS
>> >> > (key:chararray, columns: bag{(t:chararray, subcolumns: bag{(name,
>> >> > value)})});
>> >> > store rows into 'directory/test';
>> >> >
>> >> > The output that I get when I run the script looks like this (I
>> >> highlighted
>> >> > the warnings):
>> >> >
>> >> >
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> > *2012-03-29 11:10:58,063 [main] INFO  org.apache.pig.Main - Logging
>> error
>> >> > messages to: /directory/pig_1333044658058.log
>> >> > 2012-03-29 11:10:58,105 [main] INFO
>> >> > org.apache.pig.tools.parameters.PreprocessorContext - Executing
>> command :
>> >> > date "+%y%m%d%H%M%S"
>> >> > 2012-03-29 11:10:58,268 [main] INFO
>> >> > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> >> Connecting
>> >> > to hadoop file system at: file:///
>> >> > 2012-03-29 11:10:59,018 [main] INFO
>> >> > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> >> > script: UNKNOWN
>> >> > 2012-03-29 11:10:59,182 [main] INFO
>> >> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
>> >> > File concatenation threshold: 100 optimistic? false
>> >> > 2012-03-29 11:10:59,211 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> >> > - MR plan size before optimization: 1
>> >> > 2012-03-29 11:10:59,211 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>> >> > - MR plan size after optimization: 1
>> >> > 2012-03-29 11:10:59,251 [main] INFO
>> >> > org.apache.pig.tools.pigstats.ScriptState - Pig script settings are
>> added
>> >> > to the job
>> >> > 2012-03-29 11:10:59,269 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> >> > - mapred.job.reduce.markreset.buffer.percent is not set, set to
>> default
>> >> 0.3
>> >> > 2012-03-29 11:10:59,292 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
>> >> > - Setting up single store job
>> >> > 2012-03-29 11:10:59,334 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 1 map-reduce job(s) waiting for submission.
>> >> > 2012-03-29 11:10:59,361 [Thread-1] WARN
>> >>  org.apache.hadoop.mapred.JobClient
>> >> > - No job jar file set.  User classes may not be found. See
>> JobConf(Class)
>> >> > or JobConf#setJar(String).
>> >> > 2012-03-29 11:10:59,437 [Thread-1] INFO
>> >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> >> input
>> >> > paths (combined) to process : 1
>> >> > 2012-03-29 11:10:59,836 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - HadoopJobId: job_local_0001
>> >> > 2012-03-29 11:10:59,836 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 0% complete
>> >> > 2012-03-29 11:11:01,185 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task:attempt_local_0001_m_000000_0 is done. And is in the process of
>> >> > commiting
>> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>> >> > org.apache.hadoop.mapred.LocalJobRunner -
>> >> > 2012-03-29 11:11:01,189 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task attempt_local_0001_m_000000_0 is allowed to commit now
>> >> > 2012-03-29 11:11:01,192 [Thread-2] INFO
>> >> > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved
>> output
>> >> > of task 'attempt_local_0001_m_000000_0' to file:/root/root/test
>> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>> >> > org.apache.hadoop.mapred.LocalJobRunner -
>> >> > 2012-03-29 11:11:02,714 [Thread-2] INFO
>>  org.apache.hadoop.mapred.Task -
>> >> > Task 'attempt_local_0001_m_000000_0' done.
>> >> > 2012-03-29 11:11:04,842 [main] WARN
>> >> > org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get
>> RunningJob for
>> >> > job job_local_0001
>> >> > 2012-03-29 11:11:04,845 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - 100% complete
>> >> > 2012-03-29 11:11:04,845 [main] INFO
>> >> > org.apache.pig.tools.pigstats.SimplePigStats - Detected Local mode.
>> Stats
>> >> > reported below may be incomplete
>> >> > 2012-03-29 11:11:04,847 [main] INFO
>> >> > org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>> >> >
>> >> > HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt
>> >>  Features
>> >> > 0.20.203.0    0.9.3-SNAPSHOT    root    2012-03-29 11:10:59
>>  2012-03-29
>> >> > 11:11:04    UNKNOWN
>> >> >
>> >> > Success!
>> >> >
>> >> > Job Stats (time in seconds):
>> >> > JobId    Alias    Feature    Outputs
>> >> > job_local_0001    rows    MAP_ONLY    file:///root/directory/test,
>> >> >
>> >> > Input(s):
>> >> > Successfully read records from: "cassandra://Keyspace/ColumnFamily"
>> >> >
>> >> > Output(s):
>> >> > Successfully stored records in: "file:///root/directory/test"
>> >> >
>> >> > Job DAG:
>> >> > job_local_0001
>> >> >
>> >> >
>> >> > 2012-03-29 11:11:04,849 [main] INFO
>> >> >
>> >>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>> >> > - Success!*
>> >> >
>> >> >
>> >>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> >> >
>> >> >
>> >> > Now, I don't know whether it's related or not to the problem, but I
>> >> > recently noticed that ILLUSTRATE dumps the data to the terminal
>> before
>> >> > actually illustrating the schema. It outputs the same amount of data
>> >> (about
>> >> > 2-3% of the total) as it would if I just ran DUMP or STORE.
>> >> >
>> >> > I'm using Pig 0.93 in local mode with Cassandra 1.0.8
>> >> >
>> >> >
>> >> > P.S. I tried setting -Dpig.splitCombination=false as was suggested by
>> >> Matt
>> >> > in
>> >> >
>> >>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Pig-not-reading-all-cassandra-data-td5982283.html
>> >> ,
>> >> > but it didn't help...
>> >> >
>> >> >
>> >> > Thanks for your help!
>> >> > Dan F.
>> >>
>>
>
>

Re: Pig not storing/loading Cassandra data properly

Reply via email to