Thanks Ralph. I will try to reproduce this on my end with a sample data set and get back to you.
Regards Ravi On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> wrote: > Ravi, > > The create statement is attached. You will see some additional fields I > excluded from the first email. > > Thanks! > Ralph > > ------------------------------ > *From:* Ravi Kiran [maghamraviki...@gmail.com] > *Sent:* Monday, February 02, 2015 5:03 PM > *To:* user@phoenix.apache.org > > *Subject:* Re: Pig vs Bulk Load record count > > Hi Ralph, > > Is it possible to share the CREATE TABLE command as I would like to > reproduce the error on my side with a sample dataset with the specific data > types of yours. > > Regards > Ravi > > On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> > wrote: > >> Ravi, >> >> Thanks for the help - I am sorry I am not finding the upsert >> statement. Attache are the logs and output. I specify the columns because >> I get errors if I do not. >> >> I ran a test on 10K records. Pig states it processed 10K records. >> Select count() says 9030. I analyzed the 10k data in excel and there are >> no duplicates >> >> Thanks! >> Ralph >> >> __________________________________________________ >> *Ralph Perko* >> Pacific Northwest National Laboratory >> (509) 375-2272 >> ralph.pe...@pnnl.gov >> >> From: Ravi Kiran <maghamraviki...@gmail.com> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org> >> Date: Monday, February 2, 2015 at 12:23 PM >> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org> >> Subject: Re: Pig vs Bulk Load record count >> >> Hi Ralph, >> >> Regarding the upsert query in the logs, it should be *Phoenix Custom >> Upsert Statement:* as you have explicitly specified the fields in STORE >> . Is it possible to give it a try with a smaller set of records , say 8k >> to see the behavior. >> >> Regards >> Ravi >> >> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <ralph.pe...@pnnl.gov> >> wrote: >> >>> Thanks for the quick response. Here is what I have below: >>> >>> ======================================== >>> Pig script: >>> ------------------------------- >>> register $phoenix_jar; >>> >>> Z = load '$data' USING PigStorage(',') as ( >>> file_name, >>> rec_num, >>> epoch_time, >>> timet, >>> site, >>> proto, >>> saddr, >>> daddr, >>> sport, >>> dport, >>> mf, >>> cf, >>> dur, >>> sdata, >>> ddata, >>> sbyte, >>> dbyte, >>> spkt, >>> dpkt, >>> siopt, >>> diopt, >>> stopt, >>> dtopt, >>> sflags, >>> dflags, >>> flags, >>> sfseq, >>> dfseq, >>> slseq, >>> dlseq, >>> category); >>> >>> STORE Z into >>> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY' >>> using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize >>> 5000'); >>> >>> ========================= >>> >>> I cannot find the upsert statement you are referring to in either the >>> MR logs or Pig output but I do have this below – Pig thinks it output the >>> correct number of records >>> >>> Input(s): >>> Successfully read 42871627 records (1479463169 bytes) from: >>> "/data/incoming/201501124931/SAMPLE" >>> >>> Output(s): >>> Successfully stored 42871627 records in: >>> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY" >>> >>> >>> Count command: >>> select count(1) from TEST; >>> >>> __________________________________________________ >>> *Ralph Perko* >>> Pacific Northwest National Laboratory >>> (509) 375-2272 >>> ralph.pe...@pnnl.gov >>> >>> From: Ravi Kiran <maghamraviki...@gmail.com> >>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org> >>> Date: Monday, February 2, 2015 at 11:01 AM >>> To: "user@phoenix.apache.org" <user@phoenix.apache.org> >>> Subject: Re: Pig vs Bulk Load record count >>> >>> Hi Ralph, >>> >>> That's definitely a cause of worry. Can you please share the UPSERT >>> query being built by Phoenix . You should see it in the logs with an entry >>> "*Phoenix >>> Generic Upsert Statement: *.. >>> Also, what do the MapReduce counters say for the job. If possible can >>> you share the pig script as sometimes the order of columns in the STORE >>> command impacts. >>> >>> Regards >>> Ravi >>> >>> >>> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <ralph.pe...@pnnl.gov> >>> wrote: >>> >>>> Hi, I’ve run into a peculiar issue between loading data using Pig vs >>>> the CsvBulkLoadTool. I have 42M csv records to load and I am comparing the >>>> performance. >>>> >>>> In both cases the MR jobs are successful, and there are no errors. >>>> In both cases the MR job counters state there are 42M Map input and >>>> output records >>>> >>>> However, when I run count on the table when the jobs are complete >>>> something is terribly off. >>>> After the bulk load, select count shows all 42M recs in Phoenix as is >>>> expected. >>>> After the pig load there are only 3M recs in Phoenix – not even close. >>>> >>>> I have no errors to send. I have run the same test multiple times >>>> and gotten the same results. The pig script is not doing any >>>> transformations. It is a simple LOAD and STORE >>>> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT. >>>> 4.2.3-SNAPSHOT is running on the region servers. >>>> >>>> Thanks, >>>> Ralph >>>> >>>> >>> >> >