Re: Pig vs Bulk Load record count

Ravi Kiran Mon, 02 Feb 2015 17:44:11 -0800

Thanks Ralph. I will try to reproduce this on my end with a sample data set
and get back to you.


Regards
Ravi

On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> wrote:

>  Ravi,
>
> The create statement is attached.  You will see some additional fields I
> excluded from the first email.
>
>  Thanks!
> Ralph
>
>  ------------------------------
> *From:* Ravi Kiran [maghamraviki...@gmail.com]
> *Sent:* Monday, February 02, 2015 5:03 PM
> *To:* user@phoenix.apache.org
>
> *Subject:* Re: Pig vs Bulk Load record count
>
>   Hi Ralph,
>
>     Is it possible to share the CREATE TABLE command as I would like to
> reproduce the error on my side with a sample dataset with the specific data
> types of yours.
>
>  Regards
> Ravi
>
> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>
>>  Ravi,
>>
>>  Thanks for the help - I am sorry I am not finding the upsert
>> statement.  Attache are the logs and output.  I specify the columns because
>> I get errors if I do not.
>>
>>  I ran a test on 10K records.  Pig states it processed 10K records.
>> Select count() says 9030.  I analyzed the 10k data in excel and there are
>> no duplicates
>>
>>  Thanks!
>> Ralph
>>
>>  __________________________________________________
>> *Ralph Perko*
>> Pacific Northwest National Laboratory
>> (509) 375-2272
>> ralph.pe...@pnnl.gov
>>
>>   From: Ravi Kiran <maghamraviki...@gmail.com>
>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Date: Monday, February 2, 2015 at 12:23 PM
>>
>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Subject: Re: Pig vs Bulk Load record count
>>
>>    Hi Ralph,
>>
>>     Regarding the upsert query in the logs, it should be *Phoenix Custom
>> Upsert Statement:*  as you have explicitly specified the fields in STORE
>> .    Is it possible to give it a try with a smaller set of records , say 8k
>> to see the behavior.
>>
>>  Regards
>> Ravi
>>
>> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <ralph.pe...@pnnl.gov>
>> wrote:
>>
>>>  Thanks for the quick response.  Here is what I have below:
>>>
>>>  ========================================
>>>  Pig script:
>>> -------------------------------
>>> register $phoenix_jar;
>>>
>>>  Z = load '$data' USING PigStorage(',') as (
>>>   file_name,
>>>   rec_num,
>>>   epoch_time,
>>>   timet,
>>>   site,
>>>   proto,
>>>   saddr,
>>>   daddr,
>>>   sport,
>>>   dport,
>>>   mf,
>>>   cf,
>>>   dur,
>>>   sdata,
>>>   ddata,
>>>   sbyte,
>>>   dbyte,
>>>   spkt,
>>>   dpkt,
>>>   siopt,
>>>   diopt,
>>>   stopt,
>>>   dtopt,
>>>   sflags,
>>>   dflags,
>>>   flags,
>>>   sfseq,
>>>   dfseq,
>>>   slseq,
>>>   dlseq,
>>>   category);
>>>
>>>  STORE Z into
>>> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
>>> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
>>> 5000');
>>>
>>>  =========================
>>>
>>>  I cannot find the upsert statement you are referring to in either the
>>> MR logs or Pig output but I do have this below – Pig thinks it output the
>>> correct number of records
>>>
>>>  Input(s):
>>> Successfully read 42871627 records (1479463169 bytes) from:
>>> "/data/incoming/201501124931/SAMPLE"
>>>
>>>  Output(s):
>>> Successfully stored 42871627 records in:
>>> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>>>
>>>
>>>  Count command:
>>> select count(1) from TEST;
>>>
>>>  __________________________________________________
>>> *Ralph Perko*
>>> Pacific Northwest National Laboratory
>>> (509) 375-2272
>>> ralph.pe...@pnnl.gov
>>>
>>>   From: Ravi Kiran <maghamraviki...@gmail.com>
>>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>>> Date: Monday, February 2, 2015 at 11:01 AM
>>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>>> Subject: Re: Pig vs Bulk Load record count
>>>
>>>    Hi Ralph,
>>>
>>>     That's definitely a cause of worry. Can you please share the UPSERT
>>> query being built by Phoenix . You should see it in the logs with an entry 
>>> "*Phoenix
>>> Generic Upsert Statement: *..
>>>  Also, what do the MapReduce counters say for the job.  If possible can
>>> you share the pig script as sometimes the order of columns in the STORE
>>> command impacts.
>>>
>>>  Regards
>>> Ravi
>>>
>>>
>>> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <ralph.pe...@pnnl.gov>
>>> wrote:
>>>
>>>>  Hi, I’ve run into a peculiar issue between loading data using Pig vs
>>>> the CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
>>>> performance.
>>>>
>>>>  In both cases the MR jobs are successful, and there are no errors.
>>>> In both cases the MR job counters state there are 42M Map input and
>>>> output records
>>>>
>>>>  However, when I run count on the table when the jobs are complete
>>>> something is terribly off.
>>>> After the bulk load, select count shows all 42M recs in Phoenix as is
>>>> expected.
>>>> After the pig load there are only 3M recs in Phoenix – not even close.
>>>>
>>>>  I have no errors to send.  I have run the same test multiple times
>>>> and gotten the same results.    The pig script is not doing any
>>>> transformations.  It is a simple LOAD and STORE
>>>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>>>>  4.2.3-SNAPSHOT is running on the region servers.
>>>>
>>>>  Thanks,
>>>> Ralph
>>>>
>>>>
>>>
>>
>

Re: Pig vs Bulk Load record count

Reply via email to