Yes, I did filter using the same conditions you’ve mentioned. I tested it earlier with comma as the delimiter (previous email has logs) and now with ^A.
[csingh~]$ cat -v test.txt 1^A2^A76 1^A^A^A76 ^A2^A^A76 1^A1^A2^A 1^A1^A1^A76 1^A2^A1^A76 grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); grunt> DUMP D; (1,2,76,) (1,,,76) (,2,,76) (1,1,2,) (1,1,1,76) (1,2,1,76) grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); grunt> DUMP X; (1,2,1,76) So, the filter for NULL’s is working as you can see when I dump after filtering. > On Feb 19, 2016, at 12:13 AM, Parth Sawant <parth.sawan...@gmail.com> wrote: > > Did you put a Filter on the values to remove the null? I'm trying to filter > the NULL values using the Pig Filter Keyword and then use the Phoenix Pig > integration to store the data. I have '\\u001' <smb://u001'> as the delimiter > for > multiple files. It is supported by Pig BulkLoader too. > > Snippet: > > D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'>) as AS > (IS_REPORTED:INT, > PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); > > X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is not > null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND > (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); > > On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <c...@chandeep.com > <mailto:c...@chandeep.com>> wrote: > >> So, I added one record to your sample to match all the conditions you have >> in your filter statement. >> >> New input: >> [csingh]$ hadoop fs -cat test.txt >> 1,,2,76 >> 1,,,76 >> ,2,,76 >> 1,1,2, >> 1,1,1,76 >> 1,2,1,76 >> >> I modified the load statement to use PigStorage delimited by comma. >> >> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); >> >> Output: >> (1,2,1,76) >> >> So, the NOT NULL's seem to be working. >> >> Pig Log’s: >> >> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT, >> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT); >> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID >> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND >> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); >> grunt> DUMP X; >> 2016-02-18 23:01:06,336 [main] INFO >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the >> script: FILTER >> 2016-02-18 23:01:06,366 [main] INFO >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - >> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, >> DuplicateForEachColumnRewrite, GroupByConstParallelSetter, >> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, >> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten, >> PushUpFilter, SplitFilter, StreamTypeCastInserter], >> RULES_DISABLED=[FilterLogicExpressionSimplifier, PartitionFilterOptimizer]} >> 2016-02-18 23:01:06,480 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer >> - MR plan size before optimization: 1 >> 2016-02-18 23:01:10,798 [JobControl] INFO >> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is >> deprecated. Instead, use fs.defaultFS >> 2016-02-18 23:01:11,345 [JobControl] INFO >> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: >> job_1454499131434_9884 >> 2016-02-18 23:01:11,542 [JobControl] INFO >> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted >> application application_1454499131434_9884 >> 2016-02-18 23:01:11,597 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> - 0% complete >> 2016-02-18 23:01:31,393 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> - 50% complete >> 2016-02-18 23:01:36,818 [main] INFO >> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is >> deprecated. Instead, use mapreduce.job.reduces >> 2016-02-18 23:01:36,875 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> - 100% complete >> 2016-02-18 23:01:36,878 [main] INFO >> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: >> >> HadoopVersion PigVersion UserId StartedAt FinishedAt >> Features >> 2.6.0-cdh5.4.8 0.12.0-cdh5.4.8 csingh 2016-02-18 23:01:06 2016-02-18 >> 23:01:36 FILTER >> >> Success! >> >> Job Stats (time in seconds): >> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime >> MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime >> MedianReducetime Alias Feature Outputs >> job_1454499131434_9884 1 0 8 8 8 8 >> n/a n/a n/a n/a D,X MAP_ONLY >> >> Input(s): >> Successfully read 6 records (418 bytes) from: >> >> Output(s): >> Successfully stored 1 records (10 bytes) in: >> >> Counters: >> Total records written : 1 >> Total bytes written : 10 >> Spillable Memory Manager spill count : 0 >> Total bags proactively spilled: 0 >> Total records proactively spilled: 0 >> >> Job DAG: >> job_1454499131434_9884 >> >> 2016-02-18 23:01:36,976 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher >> - Success! >> 2016-02-18 23:01:36,992 [main] INFO >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths >> to process : 1 >> 2016-02-18 23:01:36,993 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input >> paths to process : 1 >> (1,2,1,76) >> >> >> >>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <parth.sawan...@gmail.com> >> wrote: >>> >>> Attaching a sample input. Basically 5 rows with only 4 Integer values in >> each. Some are NULL values. >>> >>> Thanks. >>> >>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <c...@chandeep.com >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: >>> I’m just looking for one sample record (which has NULL's) and not the >> entire input so that its easier for me to debug. >>> >>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawan...@gmail.com >>>> <mailto:parth.sawan...@gmail.com> >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> wrote: >>>> >>>> The input is simply too large to relay to others. A simplified schema >> is >>>> below. I only have INT columns with some null values in them. This is >> my >>>> Pig code snippet: >>>> >>>> D= LOAD 'src_locatn' as >>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, >>>> AFFINITY_GROUP_ID:INT; >>>> >>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is >> not >>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND >>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76); >>>> >>>> Thanks >>>> >>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <c...@chandeep.com >>>> <mailto:c...@chandeep.com> >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: >>>> >>>>> Any chance you could share a sample record which has NULL’s in it? as >> well >>>>> as your pig script? >>>>> >>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawan...@gmail.com >>>>>> <mailto:parth.sawan...@gmail.com> >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> >>>>> wrote: >>>>>> >>>>>> I had anticipated it would throw a similar error with this >> suggestion as >>>>>> the last one... and it did. My fields are declared as INT, just to >>>>>> re-iterate. I don't think they can be compared to regexes. Here is >> the >>>>>> error: >>>>>> >>>>>> ERROR 1037: >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be >>>>>> CharArray only :(Name: Regex Type: null Uid: null) >>>>>> >>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: >> ERROR >>>>> 1037: >>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be >>>>>> CharArray only :(Name: Regex Type: null Uid: null) >>>>>> >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <c...@chandeep.com >>>>>> <mailto:c...@chandeep.com> >> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: >>>>>> >>>>>>> Since you integers in this field can you try matching to a regular >>>>>>> expression? >>>>>>> >>>>>>> Something like: X matches '\\d+' <smb://d+'> >>>>>>> >>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant < >> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Chandeep. I tried that already but it gave me the following >> error: >>>>>>>> >>>>>>>> ERROR 1039: >>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible >>>>>>>> types in NotEqual Operator left hand side:int right hand >>>>>>>> side:chararray. >>>>>>>> >>>>>>>> The error makes sense cause the fields I have are INT type and >> hence >>>>>>>> cannot be compared to a chararray. >>>>>>>> >>>>>>>> >>>>>>>> Thanks for the prompt response though. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <c...@chandeep.com >>>>>>>> <mailto:c...@chandeep.com> <mailto: >> c...@chandeep.com <mailto:c...@chandeep.com>>> wrote: >>>>>>>> >>>>>>>> Try adding != '' along with IS NOT NULL. >>>>>>>>> >>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant < >> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> >> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>> >>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL' >> . >>>>> For >>>>>>>>> some >>>>>>>>>> reason the null data values persist. >>>>>>>>>> For eg: the following filter on storing it's contents, contains >> null >>>>>>>>> values >>>>>>>>>> for ABC and PQR. >>>>>>>>>> >>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND >> (PQR >>>>> IS >>>>>>>>> NOT >>>>>>>>>> NULL) ; >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can someone help with this? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Parth S >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>> <Sample_in.txt>