I'm having problems using Pig's STRSPLIT (on Amazon's cloud computing environment). I also noticed that STRSPLIT isn't documented in the Pig Latin Reference Manual, so I found out about it through other sources of information.
My problem is that in certain cases STRSPLIT returns null. I have no idea why. Here is an acual session I ran to demonstrate the problem: grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-4; Meta 1234567890 foo 34 Movement 1234567890 Rambetter 1/1 2/3 Movement 1234567890 Freddyman 10/1 10/2 grunt> A = LOAD 's3://otg-nlandys/pig-tut/bin-proto-4'; grunt> DUMP A; (Meta,1234567890,foo,34) (Movement,1234567890,Rambetter,1/1,2/3) (Movement,1234567890,Freddyman,10/1,10/2) grunt> MOVEMENT = FILTER A BY (chararray) $0 == 'Movement'; grunt> DUMP MOVEMENT; (Movement,1234567890,Rambetter,1/1,2/3) (Movement,1234567890,Freddyman,10/1,10/2) grunt> TEST = FOREACH MOVEMENT GENERATE $3 AS startpos:chararray; grunt> DUMP TEST; (1/1) (10/1) grunt> POSA = FOREACH TEST GENERATE STRSPLIT(startpos,'/'); grunt> DUMP POSA; () () _________________________________________________________________ grunt> CAT s3://otg-nlandys/pig-tut/bin-proto-5; 1/1 10/1 grunt> B = LOAD 's3://otg-nlandys/pig-tut/bin-proto-5' AS startpos:chararray; grunt> DUMP B; (1/1) (10/1) grunt> POSB = FOREACH B GENERATE STRSPLIT(startpos,'/'); grunt> DUMP POSB; ((1,1)) ((10,1)) _________________________________________________________________ My question is why POSA is empty rows and POSB isn't empty rows, when it seems that they should be identical. I'm kind of new to Pig and realize that the problem might be a shortcoming of UDF's and how Pig works with data of varying column count, but would like an explanation. Thanks. Also one other minor bug with STRSPLIT that I noticed. If your first argument to STRSPLIT is bytearray instead of chararray, it will return null. So you have to explicitly cast bytearray to chararray for it to work. Seems that this could be automated in the language, no? - Nerius
