You can use an UDF like the one bellow to deal with the NULLs.
register 'mypiggybank.jar';
define Nvl piggybank.Nvl();
input_lines = LOAD 'test_Nvl.in' AS (line:chararray);
describe input_lines;
dump input_lines;
new_list = FOREACH input_lines GENERATE Nvl(line, 'n/a');
describe new_list;
dump new_list;
-- eof
package piggybank;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class Nvl
extends EvalFunc<String>
{
public String exec(Tuple inputA)
throws IOException
{
try
{
String inputVal = (String)inputA.get(0);
String defaultVal = (String)inputA.get(1);
if(inputVal == null || inputVal.length() == 0)
{
return defaultVal;
}
return inputVal;
}
catch(Exception e)
{
// Throwing an exception will cause the task to fail.
throw new IOException("Something bad happened!", e);
}
}
}
// eof
>________________________________
> From: Mario Lassnig <[email protected]>
>To: [email protected]
>Sent: Thursday, June 7, 2012 7:37:36 AM
>Subject: Replace null with string
>
>Hello,
>
>I'm having a lot of null entries in my data. Due to later processing it would
>be very helpful if I could set a default value for null to be the string
>"other". I couldn't find a way to do this (version 0.8.1-cdh3u4)
>
>Also, I have some variables in my GENERATE statements that can potentially
>return null, and I would need something similar to the SQL DECODE function to
>get the "other" string instead of null.
>
>Example:
>
>tmp = FOREACH dump GENERATE site, REGEX_EXTRACT(name, '^(?:([^.]+)\\.?){1}',
>1) AS project, ((ami MATCHES '.*datatype.*') ? REGEX_EXTRACT(name,
>'^(?:([^.]+)\\.?){5}', 1) : 'other') AS datatype, ami, duid, nbfiles, length,
>rnbfiles, rlength, name;
>
>Here: 'site' and 'datatype' could return an empty string (which is valid) and
>is interpreted as null, but should be "other" instead.
>
>Thanks a lot,
>Mario
>
>
>