Hello,

I'm trying to write a pig script to examine a csv file and I'm having problems 
with the flatten and extract functions.  The problem is when I run the pig 
script below I get:

ERROR 1017: Schema mismatch. A basic type on flattening cannot have more than 
one column. User defined schema: {startip: chararray,endip: chararray,country: 
chararray,region: chararray,city: chararray,postal: chararray,lat: 
chararray,lon: chararray,dma: chararray,areacode: chararray}

and if I take flatten out:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during 
parsing. Encountered "" at line 27, column 6.
Was expecting one of:



Here is an example of my data:
startIpNum,endIpNum,country,region,city,postalCode,latitude,longitude,dmaCode,areaCode
1.0.0.0,1.0.0.255,"AU","","","",-27.0000,133.0000,,
1.0.1.0,1.0.1.255,"FR","B8","Avignon","",43.9500,4.8167,,

Here is my program:

--declare udf
REGISTER file:/usr/lib/pig/contrib/piggybank/java/piggybank.jar

--define aliases for any classes you wanto to use
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract();

--load in data
rawlogs = load 'geoshort.csv' using TextLoader as (line:chararray);

--print out a couple lines of data 
illustrate rawlogs;

logbase = foreach rawlogs generate
  FLATTEN(  
    EXTRACT(line, '^(\\S+) (\\S+) "(.+?)" "(.+?)" "(.+?)" "(.+?)" (\\S+) (\\S+) 
(\\S+) (\\S+)')
  )
  as (
    startip:  chararray,
    endip:    chararray,
    country:  chararray,
    region:   chararray,
    city:     chararray,
    postal:   chararray,
    lat:      chararray,
    lon:      chararray,
    dma:      chararray,
    areacode: chararray
  );

illustrate logbase;



Thanks in advance.

Reply via email to