Hi Ralph, Glad it is working!!
Regards Ravi On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> wrote: > I have solved the problem. This was a mystery because the same data > loaded into the same schema gave conflicting counts depending on the load > technique. While the data itself had no duplicate keys the behavior > suggested something was up with the keys (MR input / output had the correct > record count for both load techniques for instance). I confirmed this by > creating a pig udf that created a uuid for each row as the pk. The result > of running this test was each row appeared as expected and I got the > correct count. But I couldn’t figure out why the data itself would behave > differently because it was also unique. My pig script could hardly be > simpler with no transformations, it is a simple load and store. This ended > up being the issue! > > Solution: > Assign the correct pig data type to the PK values rather than letting pig > figure it out. I am not sure what the exact underlying issue is, but this > fixed it (perhaps when pig coerced the values to a datatype it thought best > it munged it somehow). > > Changes to pig script from below: > > > Z = load '$data' USING PigStorage(',') as ( > > file_name*:chararray*, > > rec_num*:int*, > > > Thanks for the help > > Ralph > > > From: <Ciureanu>, "Constantin (GfK)" <constantin.ciure...@gfk.com> > Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org> > Date: Tuesday, February 3, 2015 at 1:52 AM > To: "user@phoenix.apache.org" <user@phoenix.apache.org> > Subject: RE: Pig vs Bulk Load record count > > Hello Ralph, > > > > Try to check if the PIG script doesn’t produce keys that overlap (that > would explain the reduce in number of rows). > > > > Good luck, > > Constantin > > > > *From:* Ravi Kiran [mailto:maghamraviki...@gmail.com > <maghamraviki...@gmail.com>] > *Sent:* Tuesday, February 03, 2015 2:42 AM > *To:* user@phoenix.apache.org > *Subject:* Re: Pig vs Bulk Load record count > > > > Thanks Ralph. I will try to reproduce this on my end with a sample data > set and get back to you. > > Regards > > Ravi > > > > On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> > wrote: > > Ravi, > > > > The create statement is attached. You will see some additional fields I > excluded from the first email. > > > > Thanks! > > Ralph > > > ------------------------------ > > *From:* Ravi Kiran [maghamraviki...@gmail.com] > *Sent:* Monday, February 02, 2015 5:03 PM > *To:* user@phoenix.apache.org > > > *Subject:* Re: Pig vs Bulk Load record count > > > > Hi Ralph, > > Is it possible to share the CREATE TABLE command as I would like to > reproduce the error on my side with a sample dataset with the specific data > types of yours. > > Regards > Ravi > > > > On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <ralph.pe...@pnnl.gov> > wrote: > > Ravi, > > > > Thanks for the help - I am sorry I am not finding the upsert statement. > Attache are the logs and output. I specify the columns because I get > errors if I do not. > > > > I ran a test on 10K records. Pig states it processed 10K records. Select > count() says 9030. I analyzed the 10k data in excel and there are no > duplicates > > > > Thanks! > > Ralph > > > > __________________________________________________ > > *Ralph Perko* > > Pacific Northwest National Laboratory > > (509) 375-2272 > > ralph.pe...@pnnl.gov > > > > *From: *Ravi Kiran <maghamraviki...@gmail.com> > *Reply-To: *"user@phoenix.apache.org" <user@phoenix.apache.org> > *Date: *Monday, February 2, 2015 at 12:23 PM > > > *To: *"user@phoenix.apache.org" <user@phoenix.apache.org> > *Subject: *Re: Pig vs Bulk Load record count > > > > Hi Ralph, > > Regarding the upsert query in the logs, it should be *Phoenix Custom > Upsert Statement:* as you have explicitly specified the fields in STORE > . Is it possible to give it a try with a smaller set of records , say 8k > to see the behavior. > > Regards > Ravi > > > > On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <ralph.pe...@pnnl.gov> > wrote: > > Thanks for the quick response. Here is what I have below: > > > > ======================================== > > Pig script: > > ------------------------------- > > register $phoenix_jar; > > > > Z = load '$data' USING PigStorage(',') as ( > > file_name, > > rec_num, > > epoch_time, > > timet, > > site, > > proto, > > saddr, > > daddr, > > sport, > > dport, > > mf, > > cf, > > dur, > > sdata, > > ddata, > > sbyte, > > dbyte, > > spkt, > > dpkt, > > siopt, > > diopt, > > stopt, > > dtopt, > > sflags, > > dflags, > > flags, > > sfseq, > > dfseq, > > slseq, > > dlseq, > > category); > > > > STORE Z into > 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY' > using org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize > 5000'); > > > > ========================= > > > > I cannot find the upsert statement you are referring to in either the MR > logs or Pig output but I do have this below – Pig thinks it output the > correct number of records > > > > Input(s): > > Successfully read 42871627 records (1479463169 bytes) from: > "/data/incoming/201501124931/SAMPLE" > > > > Output(s): > > Successfully stored 42871627 records in: > "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY" > > > > > > Count command: > > select count(1) from TEST; > > > > __________________________________________________ > > *Ralph Perko* > > Pacific Northwest National Laboratory > > (509) 375-2272 > > ralph.pe...@pnnl.gov > > > > *From: *Ravi Kiran <maghamraviki...@gmail.com> > *Reply-To: *"user@phoenix.apache.org" <user@phoenix.apache.org> > *Date: *Monday, February 2, 2015 at 11:01 AM > *To: *"user@phoenix.apache.org" <user@phoenix.apache.org> > *Subject: *Re: Pig vs Bulk Load record count > > > > Hi Ralph, > > That's definitely a cause of worry. Can you please share the UPSERT > query being built by Phoenix . You should see it in the logs with an entry > "*Phoenix > Generic Upsert Statement: *.. > > Also, what do the MapReduce counters say for the job. If possible can you > share the pig script as sometimes the order of columns in the STORE command > impacts. > > Regards > Ravi > > > > > > On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <ralph.pe...@pnnl.gov> > wrote: > > Hi, I’ve run into a peculiar issue between loading data using Pig vs the > CsvBulkLoadTool. I have 42M csv records to load and I am comparing the > performance. > > > > In both cases the MR jobs are successful, and there are no errors. > > In both cases the MR job counters state there are 42M Map input and output > records > > > > However, when I run count on the table when the jobs are complete > something is terribly off. > > After the bulk load, select count shows all 42M recs in Phoenix as is > expected. > > After the pig load there are only 3M recs in Phoenix – not even close. > > > > I have no errors to send. I have run the same test multiple times and > gotten the same results. The pig script is not doing any > transformations. It is a simple LOAD and STORE > > I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT. > 4.2.3-SNAPSHOT is running on the region servers. > > > > Thanks, > > Ralph > > > > > > > > > > > >