[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.
[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-947: --- Fix Version/s: (was: 0.8.0) I don't think anybody is signed up for this issue. Please, relink to the release if you are interested to work on it and assign to yourself. Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple. Key: PIG-947 URL: https://issues.apache.org/jira/browse/PIG-947 Project: Pig Issue Type: Bug Components: data Environment: Pig on Hadoop 18 Reporter: Gandul Azul PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException Encountered STRING at |line 1, column 43. |Was expecting: |( ... | field discarded Below is the parser debug output for the parsing of the above error sequence: 2.071200,0), ( from above... ** FOUND A DOUBLENUMBER MATCH (2.071200) ** Call: AtomDatum Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 39 Call: Datum Matched the empty string as STRING token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a SIGNEDINTEGER token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER, LONGINTEGER, FLOATNUMBER } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a SIGNEDINTEGER token. Putting back 1 characters into the input stream. ** FOUND A SIGNEDINTEGER MATCH (0) ** Call: AtomDatum Consumed token: SIGNEDINTEGER: 0 at line 1 column 40 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ) token. ** FOUND A ) MATCH ()) ** Return: Tuple Consumed token: ) at line 1 column 41 Matched the empty string as STRING token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 42 Matched the empty string as STRING token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a STRING token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a STRING token. Putting back 1 characters into the input stream. ** FOUND A STRING MATCH ( ) ** Return: Bag Return: Datum Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.
[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-947: --- Fix Version/s: 0.8.0 Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple. Key: PIG-947 URL: https://issues.apache.org/jira/browse/PIG-947 Project: Pig Issue Type: Bug Components: data Environment: Pig on Hadoop 18 Reporter: Gandul Azul Fix For: 0.8.0 PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException Encountered STRING at |line 1, column 43. |Was expecting: |( ... | field discarded Below is the parser debug output for the parsing of the above error sequence: 2.071200,0), ( from above... ** FOUND A DOUBLENUMBER MATCH (2.071200) ** Call: AtomDatum Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 39 Call: Datum Matched the empty string as STRING token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a SIGNEDINTEGER token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER, LONGINTEGER, FLOATNUMBER } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a SIGNEDINTEGER token. Putting back 1 characters into the input stream. ** FOUND A SIGNEDINTEGER MATCH (0) ** Call: AtomDatum Consumed token: SIGNEDINTEGER: 0 at line 1 column 40 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ) token. ** FOUND A ) MATCH ()) ** Return: Tuple Consumed token: ) at line 1 column 41 Matched the empty string as STRING token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 42 Matched the empty string as STRING token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a STRING token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a STRING token. Putting back 1 characters into the input stream. ** FOUND A STRING MATCH ( ) ** Return: Bag Return: Datum Return: Parse -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.
[ https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gandul Azul updated PIG-947: Description: PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException Encountered STRING at |line 1, column 43. |Was expecting: |( ... | field discarded Below is the parser debug output for the parsing of the above error sequence: 2.071200,0), ( from above... ** FOUND A DOUBLENUMBER MATCH (2.071200) ** Call: AtomDatum Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : , (44) at line 1 column 39 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 39 Call: Datum Matched the empty string as STRING token. Current character : 0 (48) at line 1 column 40 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : 0 (48) at line 1 column 40 Currently matched the first 1 characters as a SIGNEDINTEGER token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER, LONGINTEGER, FLOATNUMBER } Current character : ) (41) at line 1 column 41 Currently matched the first 1 characters as a SIGNEDINTEGER token. Putting back 1 characters into the input stream. ** FOUND A SIGNEDINTEGER MATCH (0) ** Call: AtomDatum Consumed token: SIGNEDINTEGER: 0 at line 1 column 40 Return: AtomDatum Return: Datum Matched the empty string as STRING token. Current character : ) (41) at line 1 column 41 No more string literal token matches are possible. Currently matched the first 1 characters as a ) token. ** FOUND A ) MATCH ()) ** Return: Tuple Consumed token: ) at line 1 column 41 Matched the empty string as STRING token. Current character : , (44) at line 1 column 42 No more string literal token matches are possible. Currently matched the first 1 characters as a , token. ** FOUND A , MATCH (,) ** Consumed token: , at line 1 column 42 Matched the empty string as STRING token. Current character : (32) at line 1 column 43 No string literal matches possible. Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : (32) at line 1 column 43 Currently matched the first 1 characters as a STRING token. Possible kinds of longer matches : { STRING, SIGNEDINTEGER, DOUBLENUMBER } Current character : ( (40) at line 1 column 44 Currently matched the first 1 characters as a STRING token. Putting back 1 characters into the input stream. ** FOUND A STRING MATCH ( ) ** Return: Bag Return: Datum Return: Parse was: PigStorage parser for bags is not working correctly when a tuple in a bag is proceeded by a space. For example, the following is parsed correctly: {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} while this is not: (Note the space before the second tuple) {(-5.243084,3.142401,0.000138,2.071200,0), (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)} It seems that the parser when it encounters the space, treats the rest of the line as a String. With a schema, this results in a typecast of string to databag which results in exception. Accordingly, because of this, when using pigstorage to output a bag, it cannot be loaded using pigstorage because of this inconsistency. |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field being converted to type bag, caught ParseException Encountered STRING at |line 1, column 43. |Was expecting: |( ... | field discarded Below is the parser debug output for the parsing of the above error sequence: 2.071200,0), ( from above... ** FOUND A DOUBLENUMBER MATCH (2.071200) ** Call: AtomDatum Consumed token: DOUBLENUMBER: 2.071200 at