hi:

If you do not  find an udf in piggybank or in another    resources that
works fine with your requeriments you can create your own udf to filter,
evaluate, storage, etc  or extend someone.

For example  to storage in multiple files you can use
http://pig.apache.org/docs/r0.11.1/api/org/apache/pig/piggybank/storage/MultiStorage.html



Cheers


2013/7/30 Pablo Nebrera <[email protected]>

> Hello
>
> I have this pig script:
>
> register '/path_to_jars/elephant-bird-pig-3.0.7.jar';
> register '/path_to_jars/json-simple-1.1.1.jar';
> register '/path_to_jars/redBorder-pig.jar';
>
> data = load '/data/events/2013/07/29/16h03/part-00001.gz' using
> com.twitter.elephantbird.pig.load.JsonLoader() as (json: map[]);
> cleaned = foreach data generate json#'timestamp'/3600*3600 as timestamp,
> (chararray) json#'sensor_name' as sensor_name, (int) json#'sig_generator'
> as sig_generator, (int) json#'sig_id' as sig_id, json as data;
> grouped = GROUP cleaned BY (timestamp, sensor_name, sig_generator, sig_id);
>
>
>
>
>
> The input file is json file something like:
>
> {"timestamp":1374820560, "sensor_id":2, "sensor_name":"sensor-produccion",
> "sig_generator":1, "sig_id":402, "rev":11, "priority":3,
> "classification":"Misc activity", "msg":"Snort Alert [1:402:11]",
> "payload":"XXXXXXXXX", "proto":"icmp", "proto_id":1, "src":3232287141,
> "src_str":"192.168.201.165", "src_name":"192.168.201.165", "src_net":"
> 0.0.0.0/0", "src_net_name":"0.0.0.0/0", "dst_name":"192.168.201.254",
> "dst_str":"192.168.201.254", "dst_net":"0.0.0.0/0", "dst_net_name":"
> 0.0.0.0/0", "src_country":"N/A", "dst_country":"N/A",
> "src_country_code":"N/A", "dst_country_code":"N/A", "srcport":0,
> "dst":3232287230, "dstport":0, "ethsrc":"0:25:90:56:91:2d",
> "ethdst":"6c:62:6d:42:46:c3", "ethlength":594, "vlan":201,
> "vlan_name":"interna", "vlan_priority":0, "vlan_drop":0, "ttl":64,
> "tos":192, "id":53186, "dgmlen":576, "iplen":65544, "icmptype":3,
> "icmpcode":3, "icmpid":0, "icmpseq":0}
> {"timestamp":1374820618, "sensor_id":2, "sensor_id_snort":0,
> "sensor_name":"sensor-produccion", "sig_generator":1, "sig_id":402,
> "rev":11, "priority":3, "classification":"Misc activity", "msg":"Snort
> Alert [1:402:11]", "payload":"XXXXX2", "proto":"icmp", "proto_id":1,
> "src":3232261121, "src_str":"192.168.100.1", "src_name":"192.168.100.1",
> "src_net":"0.0.0.0/0", "src_net_name":"0.0.0.0/0",
> "dst_name":"192.168.100.125", "dst_str":"192.168.100.125", "dst_net":"
> 0.0.0.0/0", "dst_net_name":"0.0.0.0/0", "src_country":"N/A",
> "dst_country":"N/A", "src_country_code":"N/A", "dst_country_code":"N/A",
> "srcport":0, "dst":3232261245, "dstport":0, "ethsrc":"6c:62:6d:42:46:c3",
> "ethdst":"0:1e:c9:ef:85:fd", "ethlength":105, "vlan":100,
> "vlan_name":"100", "vlan_priority":0, "vlan_drop":0, "ttl":64, "tos":192,
> "id":30974, "dgmlen":87, "iplen":89088, "icmptype":3, "icmpcode":3,
> "icmpid":0, "icmpseq":0}
>
>
> The describe of grouped variable is:
>
> grunt> describe grouped
> 2013-07-30 10:11:58,834 [main] WARN  org.apache.pig.PigServer - Encountered
> Warning IMPLICIT_CAST_TO_INT 1 time(s).
> grouped: {group: (timestamp: int,sensor_name: chararray,sig_generator:
> int,sig_id: int),cleaned: {(timestamp: int,sensor_name:
> chararray,sig_generator: int,sig_id: int,data: map[])}}
>
>
>
> And a dump example is:
>
>
> ((1374818400,sensor-produccion,1,402),{(1374818400,sensor-produccion,1,402,[dst_country_code#N/A,rev#11,sig_id#402,proto_id#1,src_net_name#
>
> 0.0.0.0/0,ethlength#105,payload#45003bd9d5400401117dc0a8647dc0a864141403502745aa4b2e1001000000d726564426f7264657244454c4c00101,dst#3232261245,dstport#0,timestamp#1374820435,sensor_id_snort#0,id#30968,vlan_name#100,tos#192,src_net#0.0.0.0/0,priority#3,src_name#192.168.100.1,dgmlen#87,ethsrc#6c:62:6d:42:46:c3,src#3232261121,icmpcode#3,src_str#192.168.100.1,srcport#0,sensor_id#2,dst_net#0.0.0.0/0,ttl#64,msg#SnortAlert
>
> [1:402:11],proto#icmp,vlan_priority#0,dst_country#N/A,dst_name#192.168.100.125,dst_net_name#
>
> 0.0.0.0/0,ethdst#0:1e:c9:ef:85:fd,iplen#89088,src_country_code#N/A,sensor_name#sensor-produccion,sig_generator#1,dst_str#192.168.100.125,class......
> })
>
>
> I have some questions:
>
> 1.- When I do the the group by I would like to get an entry with the first
> entry only. Somethink like:
>
>                   ((1374818400,sensor-produccion,1,402), { only the first
> tuple match the groupby })
>
> 2.- I would like to store this information in multiple files. Something
> like:
>
>                    /data/(1374818400,sensor-produccion,1,402)/data.gz
>
>
> How could I do this ?
>
> Thanks
>
>
>
>
> Pablo Nebrera
>

Reply via email to