Re: Using NOT NULL in a Pig FILTER statement.

Chandeep Singh Fri, 19 Feb 2016 13:00:13 -0800
Great :)

> On Feb 19, 2016, at 8:58 PM, Parth Sawant <parth.sawan...@gmail.com> wrote:
> 
> Hi Chandeep,
> Thanks for your help. I figured it out too.
> 
> On Fri, Feb 19, 2016 at 9:30 AM, Chandeep Singh <c...@chandeep.com 
> <mailto:c...@chandeep.com>> wrote:
> 
>> Yes, I did filter using the same conditions you’ve mentioned. I tested it
>> earlier with comma as the delimiter (previous email has logs) and now with
>> ^A.
>> 
>> [csingh~]$ cat -v test.txt
>> 1^A2^A76
>> 1^A^A^A76
>> ^A2^A^A76
>> 1^A1^A2^A
>> 1^A1^A1^A76
>> 1^A2^A1^A76
>> 
>> grunt> D = LOAD 'test.txt' USING PigStorage('\\u001') AS (IS_REPORTED:INT,
>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>> grunt> DUMP D;
>> (1,2,76,)
>> (1,,,76)
>> (,2,,76)
>> (1,1,2,)
>> (1,1,1,76)
>> (1,2,1,76)
>> 
>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID
>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>> 
>> grunt> DUMP X;
>> (1,2,1,76)
>> 
>> 
>> So, the filter for NULL’s is working as you can see when I dump after
>> filtering.
>> 
>>> On Feb 19, 2016, at 12:13 AM, Parth Sawant <parth.sawan...@gmail.com>
>> wrote:
>>> 
>>> Did you put a Filter on the values to remove the null? I'm trying to
>> filter
>>> the NULL values using the Pig Filter Keyword and then use the Phoenix Pig
>>> integration to store the data. I have '\\u001' <smb://u001'> <smb://u001' 
>>> <smb://u001'>> as the
>> delimiter for
>>> multiple files. It is supported by Pig BulkLoader too.
>>> 
>>> Snippet:
>>> 
>>> D = LOAD 'src_dest' using PigStorage('\\u001' <smb://u001'> <smb://u001' 
>>> <smb://u001'>>) as AS
>> (IS_REPORTED:INT,
>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>> 
>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>> not
>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>> 
>>> On Thu, Feb 18, 2016 at 3:06 PM, Chandeep Singh <c...@chandeep.com 
>>> <mailto:c...@chandeep.com>
>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>> wrote:
>>> 
>>>> So, I added one record to your sample to match all the conditions you
>> have
>>>> in your filter statement.
>>>> 
>>>> New input:
>>>> [csingh]$ hadoop fs -cat test.txt
>>>> 1,,2,76
>>>> 1,,,76
>>>> ,2,,76
>>>> 1,1,2,
>>>> 1,1,1,76
>>>> 1,2,1,76
>>>> 
>>>> I modified the load statement to use PigStorage delimited by comma.
>>>> 
>>>> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>>> 
>>>> Output:
>>>> (1,2,1,76)
>>>> 
>>>> So, the NOT NULL's seem to be working.
>>>> 
>>>> Pig Log’s:
>>>> 
>>>> grunt> D = LOAD 'test.txt' USING PigStorage(',') AS (IS_REPORTED:INT,
>>>> PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT, AFFINITY_GROUP_ID:INT);
>>>> grunt> X = FILTER D BY (IS_REPORTED is not null) AND
>> (PROCESSING_STATUS_ID
>>>> is not null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>> grunt> DUMP X;
>>>> 2016-02-18 23:01:06,336 [main] INFO
>>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>>>> script: FILTER
>>>> 2016-02-18 23:01:06,366 [main] INFO
>>>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>>>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
>>>> DuplicateForEachColumnRewrite, GroupByConstParallelSetter,
>>>> ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter,
>> MergeFilter,
>>>> MergeForEach, NewPartitionFilterOptimizer, PushDownForEachFlatten,
>>>> PushUpFilter, SplitFilter, StreamTypeCastInserter],
>>>> RULES_DISABLED=[FilterLogicExpressionSimplifier,
>> PartitionFilterOptimizer]}
>>>> 2016-02-18 23:01:06,480 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
>>>> - MR plan size before optimization: 1
>>>> 2016-02-18 23:01:10,798 [JobControl] INFO
>>>> org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is
>>>> deprecated. Instead, use fs.defaultFS
>>>> 2016-02-18 23:01:11,345 [JobControl] INFO
>>>> org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job:
>>>> job_1454499131434_9884
>>>> 2016-02-18 23:01:11,542 [JobControl] INFO
>>>> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted
>>>> application application_1454499131434_9884
>>>> 2016-02-18 23:01:11,597 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 0% complete
>>>> 2016-02-18 23:01:31,393 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 50% complete
>>>> 2016-02-18 23:01:36,818 [main] INFO
>>>> org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks
>> is
>>>> deprecated. Instead, use mapreduce.job.reduces
>>>> 2016-02-18 23:01:36,875 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - 100% complete
>>>> 2016-02-18 23:01:36,878 [main] INFO
>>>> org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
>>>> 
>>>> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
>>>> Features
>>>> 2.6.0-cdh5.4.8  0.12.0-cdh5.4.8 csingh  2016-02-18 23:01:06
>> 2016-02-18
>>>> 23:01:36     FILTER
>>>> 
>>>> Success!
>>>> 
>>>> Job Stats (time in seconds):
>>>> JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime
>>>> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime
>>>> MedianReducetime        Alias   Feature Outputs
>>>> job_1454499131434_9884  1       0       8       8       8       8
>>>> n/a     n/a     n/a     n/a     D,X     MAP_ONLY
>>>> 
>>>> Input(s):
>>>> Successfully read 6 records (418 bytes) from:
>>>> 
>>>> Output(s):
>>>> Successfully stored 1 records (10 bytes) in:
>>>> 
>>>> Counters:
>>>> Total records written : 1
>>>> Total bytes written : 10
>>>> Spillable Memory Manager spill count : 0
>>>> Total bags proactively spilled: 0
>>>> Total records proactively spilled: 0
>>>> 
>>>> Job DAG:
>>>> job_1454499131434_9884
>>>> 
>>>> 2016-02-18 23:01:36,976 [main] INFO
>>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>>>> - Success!
>>>> 2016-02-18 23:01:36,992 [main] INFO
>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input
>> paths
>>>> to process : 1
>>>> 2016-02-18 23:01:36,993 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>>>> paths to process : 1
>>>> (1,2,1,76)
>>>> 
>>>> 
>>>> 
>>>>> On Feb 18, 2016, at 10:13 PM, Parth Sawant <parth.sawan...@gmail.com 
>>>>> <mailto:parth.sawan...@gmail.com>>
>>>> wrote:
>>>>> 
>>>>> Attaching a sample input. Basically 5 rows with only 4 Integer values
>> in
>>>> each. Some are NULL values.
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> On Thu, Feb 18, 2016 at 2:03 PM, Chandeep Singh <c...@chandeep.com 
>>>>> <mailto:c...@chandeep.com>
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com> 
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>>> wrote:
>>>>> I’m just looking for one sample record (which has NULL's) and not the
>>>> entire input so that its easier for me to debug.
>>>>> 
>>>>>> On Feb 18, 2016, at 9:40 PM, Parth Sawant <parth.sawan...@gmail.com 
>>>>>> <mailto:parth.sawan...@gmail.com>
>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> 
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>>>
>> wrote:
>>>>>> 
>>>>>> The input is simply too large to relay to others. A simplified schema
>>>> is
>>>>>> below. I only have INT columns with some null values in them. This is
>>>> my
>>>>>> Pig code snippet:
>>>>>> 
>>>>>> D= LOAD 'src_locatn' as
>>>>>> IS_REPORTED:INT, PROCESSING_STATUS_ID:INT, PROGRAM_ID:INT,
>>>>>> AFFINITY_GROUP_ID:INT;
>>>>>> 
>>>>>> X = FILTER D BY (IS_REPORTED is not null) AND (PROCESSING_STATUS_ID is
>>>> not
>>>>>> null) AND (IS_REPORTED==1) AND (PROGRAM_ID==1) AND
>>>>>> (PROCESSING_STATUS_ID==2) AND (AFFINITY_GROUP_ID==76);
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> On Thu, Feb 18, 2016 at 12:59 PM, Chandeep Singh <c...@chandeep.com 
>>>>>> <mailto:c...@chandeep.com>
>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com> 
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>>> wrote:
>>>>>> 
>>>>>>> Any chance you could share a sample record which has NULL’s in it? as
>>>> well
>>>>>>> as your pig script?
>>>>>>> 
>>>>>>>> On Feb 18, 2016, at 8:36 PM, Parth Sawant <parth.sawan...@gmail.com 
>>>>>>>> <mailto:parth.sawan...@gmail.com>
>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> 
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I had anticipated it would throw a similar error with this
>>>> suggestion as
>>>>>>>> the last one... and it did. My fields are declared as INT, just to
>>>>>>>> re-iterate. I don't think they can be compared to regexes. Here is
>>>> the
>>>>>>>> error:
>>>>>>>> 
>>>>>>>> ERROR 1037:
>>>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>>>> 
>>>>>>>> org.apache.pig.impl.logicalLayer.validators.TypeCheckerException:
>>>> ERROR
>>>>>>> 1037:
>>>>>>>> <file LeadSales.pig, line 19, column 29> Operands of Regex can be
>>>>>>>> CharArray only :(Name: Regex Type: null Uid: null)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 18, 2016 at 5:24 AM, Chandeep Singh <c...@chandeep.com 
>>>>>>>> <mailto:c...@chandeep.com>
>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com> 
>>>> <mailto:c...@chandeep.com <mailto:c...@chandeep.com>>>> wrote:
>>>>>>>> 
>>>>>>>>> Since you integers in this field can you try matching to a regular
>>>>>>>>> expression?
>>>>>>>>> 
>>>>>>>>> Something like: X matches '\\d+' <smb://d+'> <smb://d+' <smb://d+'>>
>>>>>>>>> 
>>>>>>>>>> On Feb 18, 2016, at 12:55 AM, Parth Sawant <
>>>> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> 
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>> 
>>>> <mailto:
>> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> 
>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>>>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Chandeep. I tried that already but it gave me the following
>>>> error:
>>>>>>>>>> 
>>>>>>>>>> ERROR 1039:
>>>>>>>>>> <file LeadSales.pig, line 19, column 27> In alias X, incompatible
>>>>>>>>>> types in NotEqual Operator left hand side:int right hand
>>>>>>>>>> side:chararray.
>>>>>>>>>> 
>>>>>>>>>> The error makes sense cause the fields I have are INT type and
>>>> hence
>>>>>>>>>> cannot be compared to a chararray.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks for the prompt response though.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Feb 17, 2016 16:32, "Chandeep Singh" <c...@chandeep.com 
>>>>>>>>>> <mailto:c...@chandeep.com> <mailto:
>> c...@chandeep.com <mailto:c...@chandeep.com>> <mailto:
>>>> c...@chandeep.com <mailto:c...@chandeep.com> <mailto:c...@chandeep.com 
>>>> <mailto:c...@chandeep.com>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Try adding != '' along with IS NOT NULL.
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 18, 2016, at 12:26 AM, Parth Sawant <
>>>> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com> 
>>>> <mailto:parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>> 
>>>> <mailto:
>> parth.sawan...@gmail.com <mailto:parth.sawan...@gmail.com>>
>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm trying to Filter some null fields in Pig using 'IS NOT NULL'
>>>> .
>>>>>>> For
>>>>>>>>>>> some
>>>>>>>>>>>> reason the null data values persist.
>>>>>>>>>>>> For eg: the following filter on storing it's contents, contains
>>>> null
>>>>>>>>>>> values
>>>>>>>>>>>> for ABC and PQR.
>>>>>>>>>>>> 
>>>>>>>>>>>> X = FILTER D BY (ABC IS NOT NULL) AND (ABC==1) AND (PQR==1) AND
>>>> (PQR
>>>>>>> IS
>>>>>>>>>>> NOT
>>>>>>>>>>>> NULL) ;
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Can someone help with this?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> 
>>>>>>>>>>>> Parth S
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>>> <Sample_in.txt>
Re: Using NOT NULL in a Pig FILTER statement.

Reply via email to