Hello,
I'm trying to write a pig script to examine a csv file and I'm having problems
with the flatten and extract functions. The problem is when I run the pig
script below I get:
ERROR 1017: Schema mismatch. A basic type on flattening cannot have more than
one column. User defined schema: {startip: chararray,endip: chararray,country:
chararray,region: chararray,city: chararray,postal: chararray,lat:
chararray,lon: chararray,dma: chararray,areacode: chararray}
and if I take flatten out:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during
parsing. Encountered "" at line 27, column 6.
Was expecting one of:
Here is an example of my data:
startIpNum,endIpNum,country,region,city,postalCode,latitude,longitude,dmaCode,areaCode
1.0.0.0,1.0.0.255,"AU","","","",-27.0000,133.0000,,
1.0.1.0,1.0.1.255,"FR","B8","Avignon","",43.9500,4.8167,,
Here is my program:
--declare udf
REGISTER file:/usr/lib/pig/contrib/piggybank/java/piggybank.jar
--define aliases for any classes you wanto to use
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract();
--load in data
rawlogs = load 'geoshort.csv' using TextLoader as (line:chararray);
--print out a couple lines of data
illustrate rawlogs;
logbase = foreach rawlogs generate
FLATTEN(
EXTRACT(line, '^(\\S+) (\\S+) "(.+?)" "(.+?)" "(.+?)" "(.+?)" (\\S+) (\\S+)
(\\S+) (\\S+)')
)
as (
startip: chararray,
endip: chararray,
country: chararray,
region: chararray,
city: chararray,
postal: chararray,
lat: chararray,
lon: chararray,
dma: chararray,
areacode: chararray
);
illustrate logbase;
Thanks in advance.