Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?

brice lecomte Thu, 28 Feb 2013 02:27:51 -0800

Hi Johnny,
bad things,

grunt> REGISTER json-simple-1.1.1.jar
grunt> REGISTER lib/jackson-core-asl-1.8.8.jar
grunt> REGISTER lib/jackson-mapper-asl-1.8.8.jar
grunt> REGISTER /usr/local/pig-0.10.1-src/build/ivy/lib/Pig/avro-1.5.3.jar
grunt> REGISTER
/usr/local/pig-0.10.1-src/contrib/piggybank/java/piggybank.jar
grunt> logs = LOAD 'auth.log' as (f1:chararray);
grunt> c = foreach logs  generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3})
([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1})
([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
grunt> df = GROUP c by ($1, $4);
2013-02-28 10:57:32,630 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000:
<line 3, column 17> Out of bound access. Trying to access non-existent
column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_4:tuple()
*has 1 column(s)*.
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
grunt> dump c;
[...]


*((Feb,28,10:50:13,hadoop-master,sshd,debug1: session_input_channel_req:
session 0 req window-change))*

=> looks like a tuple of tuple ?

grunt> df = GROUP c by ($1, $4);
2013-02-28 10:57:59,274 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000:
<line 3, column 17> Out of bound access. Trying to access non-existent
column: 1. Schema org.apache.pig.builtin.regex_extract_all_f1_10:tuple()
has 1 column(s).
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
grunt> df = GROUP c by (c.$1, c.$4);
2013-02-28 10:58:06,873 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<line 3, column 17> Invalid scalar projection: c
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log
grunt> df = GROUP c by (c.$0.$1, c.$0.$4);
grunt> dump df;
[...]

2013-02-28 10:58:46,781 [Thread-16] WARN 
org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
has more than one row in the output. 1st :
((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session
opened for user root by (uid=0))), 2nd
:((Feb,24,07:39:01,hadoop-master,CRON,pam_unix(cron:session): session
closed for user root))
        at
org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
[...]

grunt> df = GROUP c by (FLATTEN(c.$1), FLATTEN(c.$4));
2013-02-28 10:59:31,187 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<line 4, column 25> Invalid scalar projection: c
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log

grunt> df = GROUP c by (FLATTEN(c).$1, FLATTEN(c).$4);
2013-02-28 10:59:51,062 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: Pig script failed to parse:
<line 4, column 25> Invalid scalar projection: c : A column needs to be
projected from a relation for it to be used as a scalar
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log

grunt> df = GROUP c by (FLATTEN(c.$0).$1, FLATTEN(c.$0).$4);
2013-02-28 11:17:46,744 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1070: Could not resolve FLATTEN using imports: [,
org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/hduser/pigtmp/pig_1362045005909.log

even tried the perl way:
grunt> (m:chararray, d:int, time:chararray, hostname:chararray,
service:chararray, info:chararray) = foreach logs  generate
REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3}) ([0-9]{1,2})
([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}) ([a-zA-Z0-9-_]+)
([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
2013-02-28 11:23:47,995 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Lexical error at line 1, column 1. 
Encountered: "(" (40), after : ""

:(

Le 27/02/2013 20:26, Johnny Zhang a écrit :
> Hi, Brice:
> Instead of save&reload it, can you try 'dump c;' first then use c.$0 ?
>
> Johnny
>
>
> On Wed, Feb 27, 2013 at 8:49 AM, brice lecomte <[email protected]> wrote:
>
>> Hello,
>> --Pig 0.10.0--
>> I'd like to access straitght forward to the result of:
>> grunt> c = foreach logs  generate REGEX_EXTRACT_ALL(f1, '([a-zA-Z]{3,3})
>> ([0-9]{1,2}) ([0-2]{1}[0-9]{1}:[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1})
>> ([a-zA-Z0-9-_]+) ([a-zA-Z]+)\\[[0-9]+\\]: (.*)');
>> grunt> illustrate c;
>>
>>
>> -------------------------------------------------------------------------------------------------------------
>> | logs     |
>> f1:chararray
>> |
>>
>> -------------------------------------------------------------------------------------------------------------
>> |          | Feb 24 20:09:01 hadoop-master CRON[3574]:
>> pam_unix(cron:session): session closed for user root |
>>
>> -------------------------------------------------------------------------------------------------------------
>>
>> ----------------------------------------------------------------------------
>> | c     | org.apache.pig.builtin.regex_extract_all_f1_178:tuple()
>>  |
>>
>> ----------------------------------------------------------------------------
>> |       | (Feb, ..., pam_unix(cron:session): session closed for user root)
>> |
>>
>> ----------------------------------------------------------------------------
>>
>> but the only way I found is to save&reload it:
>>
>> grunt> store c into 'pig/AUTH.result';
>> grunt> auth = LOAD 'pig/AUTH.result/part-m-00000' USING PigStorage(',')
>> AS (m:chararray, d:int, time:chararray, hostname:chararray,
>> service:chararray, info:chararray);
>> grunt> day_frequency = GROUP auth by (d,service);
>> ...
>>
>> is there a way to name the tuple items or to access them such as c.$0 or
>> FLATTEN(c).$0.... ??
>>
>> Thanks,
>> Brice
>>
>>

Re: How to access to the tuple items of REGEX_EXTRACT_ALL ?

Reply via email to