[jira] Commented: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement
[ https://issues.apache.org/jira/browse/PIG-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780993#action_12780993 ] Ashutosh Chauhan commented on PIG-1101: --- Test failure seems to be hudson's quirks. Same test passes on my local machine. This patch is ready for review. Pig parser does not recognize its own data type in LIMIT statement -- Key: PIG-1101 URL: https://issues.apache.org/jira/browse/PIG-1101 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Ashutosh Chauhan Priority: Minor Fix For: 0.7.0 Attachments: pig-1101.patch I have a Pig script in which I specify the number of records to limit as a long type. {code} A = LOAD '/user/viraj/echo.txt' AS (txt:chararray); B = LIMIT A 10L; DUMP B; {code} I get a parser error: 2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered LONGINTEGER 10L at line 3, column 13. Was expecting: INTEGER ... at org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963) at org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839) at org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656) at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682) at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017) In fact 10L seems to work in the foreach generate construct. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780998#action_12780998 ] Alan Gates commented on PIG-1077: - Patch checked into 0.6 branch. [Zebra] to support record(row)-based file split in Zebra's TableInputFormat --- Key: PIG-1077 URL: https://issues.apache.org/jira/browse/PIG-1077 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0, 0.7.0 Attachments: patch_Pig1077 TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra. One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file. In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1077: Fix Version/s: 0.6.0 [Zebra] to support record(row)-based file split in Zebra's TableInputFormat --- Key: PIG-1077 URL: https://issues.apache.org/jira/browse/PIG-1077 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.6.0, 0.7.0 Attachments: patch_Pig1077 TFile currently supports split by record sequence number (see Jira HADOOP-6218). We want to utilize this to provide record(row)-based input split support in Zebra. One prominent benefit is that: in cases where we have very large data files, we can create much more fine-grained input splits than before where we can only create one big split for one big file. In more detail, the new row-based getSplits() works by default (user does not specify no. of splits to be generated) as follows: 1) Select the biggest column group in terms of data size, split all of its TFiles according to hdfs block size (64 MB or 128 MB) and get a list of physical byte offsets as the output per TFile. For example, let us assume for the 1st TFile we get offset1, offset2, ..., offset10; 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a key-value pair near a byte offset. For the example above, say we get recordNum1, recordNum2, ..., recordNum10; 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, respectively to form 11 record-based input splits for the 1st TFile. 4) For each input split, we need to create a TFile scanner through: TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). Note: conversion from byte offset to record number will be done by each mapper, rather than being done at the job initialization phase. This is due to performance concern since the conversion incurs some TFile reading overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Is Pig dropping records?
Sam, Can you post your changes to a Jira? -D On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash s...@ning.com wrote: Hi, This reminds me of something else, though, that I took the latest patch for PIG-911 (sequence file reader) and found it skipped records https://issues.apache.org/jira/browse/PIG-911 What I found is that the condition in getNext() would miss records: if (reader != null (reader.getPosition() end || !reader.syncSeen()) reader.next(key, value)) { ... } I had to change it to: if (reader != null reader.next(key,value) (reader.getPosition() end || !reader.syncSeen())) { ... } (also ended up breaking out to read(key) and get the below to support reading other types than Writable) This only happened when I file files pig read where more than one block; ie, the records dropped were around block boundaries. has anyone noticed this? thx, -sr Sam Rash s...@ning.com On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote: Zaki, Glad to hear it wasn't Pig's fault! Can you post a description of what was going on with S3, or at least how you fixed it? -D On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman zaki.raha...@gmail.com wrote: Okay fixed some problem with corrupted file transfers from S3... now wc -l produces the same 143710 records... so yea its not a problem... and now I am getting the correct result from both methods... not sure what went wrong... thanks for the help though guys. On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair te...@yahoo-inc.com wrote: Another thing to verify is that clickurl's position in the schema is correct. -Thejas On 11/19/09 11:43 AM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Hmm... You are sure that your records are separated by /n (newline) and fields by /t (tab). If so, will it be possible you to upload your dataset (possibly smaller) somewhere so that someone can take a look at that. Ashutosh On Thu, Nov 19, 2009 at 14:35, zaki rahaman zaki.raha...@gmail.com wrote: On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: Hi Zaki, Just to narrow down the problem, can you do: A = LOAD 's3n://bucket/*week.46*clickLog.2009*'; dump A; This produced 143710 records; and A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS ( timestamp:chararray, ip:chararray, userid:chararray, dist:chararray, clickid:chararray, usra:chararray, campaign:chararray, clickurl:chararray, plugin:chararray, tab:chararray, feature:chararray); dump A; This produced 143710 records (so no problem there); and cut -f8 *week.46*clickLog.2009* | wc -l This produced... 175572 Clearly, something is wrong... Thanks, Ashutosh On Thu, Nov 19, 2009 at 14:03, zaki rahaman zaki.raha...@gmail.com wrote: Hi All, I have the following mini-script running as part of a larger set of scripts/workflow... however it seems like pig is dropping records as when I tried running the same thing as a simple grep | wc -l I get a completely different result (2500 with Pig vs. 3300). The Pig script is as follows: A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (timestamp:chararray, ip:chararray, userid:chararray, dist:chararray, clickid:chararray, usra:chararray, campaign:chararray, clickurl:chararray, plugin:chararray, tab:chararray, feature:chararray); B = FILTER raw BY clickurl matches '.*http://www.amazon.*'; dump B produces the following output: 2009-11-19 18:50:46,013 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch er - Successfully stored result in: s3://kikin-pig-test/amazonoutput2 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch er - Records written : 2502 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch er - Bytes written : 0 2009-11-19 18:50:46,058 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch er - Success! The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep http://www.amazon | wc -l Both sets of inputs are the same files... and I'm not sure where the discrepency is coming from. Any help would be greatly appreciated. -- Zaki Rahaman -- Zaki Rahaman -- Zaki Rahaman