Re: Pig vs Bulk Load record count

Ravi Kiran Tue, 03 Feb 2015 16:32:19 -0800

Hi Ralph,

   Glad it is working!!


Regards
Ravi

On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> wrote:

>   I have solved the problem.  This was a mystery because the same data
> loaded into the same schema gave conflicting counts depending on the load
> technique.  While the data itself had no duplicate keys the behavior
> suggested something was up with the keys (MR input / output had the correct
> record count for both load techniques for instance).  I confirmed this by
> creating a pig udf that created a uuid for each row as the pk.  The result
> of running this test was each row appeared as expected and I got the
> correct count.  But I couldn’t figure out why the data itself would behave
> differently because it was also unique.  My pig script could hardly be
> simpler with no transformations, it is a simple load and store.  This ended
> up being the issue!
>
>  Solution:
>  Assign the correct pig data type to the PK values rather than letting pig
> figure it out.  I am not sure what the exact underlying issue is, but this
> fixed it (perhaps when pig coerced the values to a datatype it thought best
> it munged it somehow).
>
>  Changes to pig script from below:
>
>
>  Z = load '$data' USING PigStorage(',') as (
>
>   file_name*:chararray*,
>
>   rec_num*:int*,
>
>
>  Thanks for the help
>
> Ralph
>
>
>    From: <Ciureanu>, "Constantin (GfK)" <constantin.ciure...@gfk.com>
> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Date: Tuesday, February 3, 2015 at 1:52 AM
> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Subject: RE: Pig vs Bulk Load record count
>
>    Hello Ralph,
>
>
>
> Try to check if the PIG script doesn’t produce keys that overlap (that
> would explain the reduce in number of rows).
>
>
>
> Good luck,
>
>    Constantin
>
>
>
> *From:* Ravi Kiran [mailto:maghamraviki...@gmail.com
> <maghamraviki...@gmail.com>]
> *Sent:* Tuesday, February 03, 2015 2:42 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Pig vs Bulk Load record count
>
>
>
> Thanks Ralph. I will try to reproduce this on my end with a sample data
> set and get back to you.
>
> Regards
>
> Ravi
>
>
>
> On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>
> Ravi,
>
>
>
> The create statement is attached.  You will see some additional fields I
> excluded from the first email.
>
>
>
> Thanks!
>
> Ralph
>
>
>  ------------------------------
>
> *From:* Ravi Kiran [maghamraviki...@gmail.com]
> *Sent:* Monday, February 02, 2015 5:03 PM
> *To:* user@phoenix.apache.org
>
>
> *Subject:* Re: Pig vs Bulk Load record count
>
>
>
> Hi Ralph,
>
>    Is it possible to share the CREATE TABLE command as I would like to
> reproduce the error on my side with a sample dataset with the specific data
> types of yours.
>
> Regards
> Ravi
>
>
>
> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>
> Ravi,
>
>
>
> Thanks for the help - I am sorry I am not finding the upsert statement.
> Attache are the logs and output.  I specify the columns because I get
> errors if I do not.
>
>
>
> I ran a test on 10K records.  Pig states it processed 10K records.  Select
> count() says 9030.  I analyzed the 10k data in excel and there are no
> duplicates
>
>
>
> Thanks!
>
> Ralph
>
>
>
> __________________________________________________
>
> *Ralph Perko*
>
> Pacific Northwest National Laboratory
>
> (509) 375-2272
>
> ralph.pe...@pnnl.gov
>
>
>
> *From: *Ravi Kiran <maghamraviki...@gmail.com>
> *Reply-To: *"user@phoenix.apache.org" <user@phoenix.apache.org>
> *Date: *Monday, February 2, 2015 at 12:23 PM
>
>
> *To: *"user@phoenix.apache.org" <user@phoenix.apache.org>
> *Subject: *Re: Pig vs Bulk Load record count
>
>
>
>   Hi Ralph,
>
>    Regarding the upsert query in the logs, it should be *Phoenix Custom
> Upsert Statement:*  as you have explicitly specified the fields in STORE
> .    Is it possible to give it a try with a smaller set of records , say 8k
> to see the behavior.
>
> Regards
> Ravi
>
>
>
> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>
> Thanks for the quick response.  Here is what I have below:
>
>
>
> ========================================
>
> Pig script:
>
> -------------------------------
>
> register $phoenix_jar;
>
>
>
> Z = load '$data' USING PigStorage(',') as (
>
>   file_name,
>
>   rec_num,
>
>   epoch_time,
>
>   timet,
>
>   site,
>
>   proto,
>
>   saddr,
>
>   daddr,
>
>   sport,
>
>   dport,
>
>   mf,
>
>   cf,
>
>   dur,
>
>   sdata,
>
>   ddata,
>
>   sbyte,
>
>   dbyte,
>
>   spkt,
>
>   dpkt,
>
>   siopt,
>
>   diopt,
>
>   stopt,
>
>   dtopt,
>
>   sflags,
>
>   dflags,
>
>   flags,
>
>   sfseq,
>
>   dfseq,
>
>   slseq,
>
>   dlseq,
>
>   category);
>
>
>
> STORE Z into
> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
> 5000');
>
>
>
> =========================
>
>
>
> I cannot find the upsert statement you are referring to in either the MR
> logs or Pig output but I do have this below – Pig thinks it output the
> correct number of records
>
>
>
> Input(s):
>
> Successfully read 42871627 records (1479463169 bytes) from:
> "/data/incoming/201501124931/SAMPLE"
>
>
>
> Output(s):
>
> Successfully stored 42871627 records in:
> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>
>
>
>
>
> Count command:
>
> select count(1) from TEST;
>
>
>
> __________________________________________________
>
> *Ralph Perko*
>
> Pacific Northwest National Laboratory
>
> (509) 375-2272
>
> ralph.pe...@pnnl.gov
>
>
>
> *From: *Ravi Kiran <maghamraviki...@gmail.com>
> *Reply-To: *"user@phoenix.apache.org" <user@phoenix.apache.org>
> *Date: *Monday, February 2, 2015 at 11:01 AM
> *To: *"user@phoenix.apache.org" <user@phoenix.apache.org>
> *Subject: *Re: Pig vs Bulk Load record count
>
>
>
>   Hi Ralph,
>
>    That's definitely a cause of worry. Can you please share the UPSERT
> query being built by Phoenix . You should see it in the logs with an entry 
> "*Phoenix
> Generic Upsert Statement: *..
>
> Also, what do the MapReduce counters say for the job.  If possible can you
> share the pig script as sometimes the order of columns in the STORE command
> impacts.
>
> Regards
> Ravi
>
>
>
>
>
> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>
> Hi, I’ve run into a peculiar issue between loading data using Pig vs the
> CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
> performance.
>
>
>
> In both cases the MR jobs are successful, and there are no errors.
>
> In both cases the MR job counters state there are 42M Map input and output
> records
>
>
>
> However, when I run count on the table when the jobs are complete
> something is terribly off.
>
> After the bulk load, select count shows all 42M recs in Phoenix as is
> expected.
>
> After the pig load there are only 3M recs in Phoenix – not even close.
>
>
>
> I have no errors to send.  I have run the same test multiple times and
> gotten the same results.    The pig script is not doing any
> transformations.  It is a simple LOAD and STORE
>
> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>  4.2.3-SNAPSHOT is running on the region servers.
>
>
>
> Thanks,
>
> Ralph
>
>
>
>
>
>
>
>
>
>
>
>

Re: Pig vs Bulk Load record count

Reply via email to