[jira] Commented: (PIG-1101) Pig parser does not recognize its own data type in LIMIT statement

2009-11-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780993#action_12780993
 ] 

Ashutosh Chauhan commented on PIG-1101:
---

Test failure seems to be hudson's quirks. Same test passes on my local machine. 
This patch is ready for review.

 Pig parser does not recognize its own data type in LIMIT statement
 --

 Key: PIG-1101
 URL: https://issues.apache.org/jira/browse/PIG-1101
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig-1101.patch


 I have a Pig script in which I specify the number of records to limit as a 
 long type. 
 {code}
 A = LOAD '/user/viraj/echo.txt' AS (txt:chararray);
 B = LIMIT A 10L;
 DUMP B;
 {code}
 I get a parser error:
 2009-11-21 02:25:51,100 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Encountered  LONGINTEGER 10L  at line 3, 
 column 13.
 Was expecting:
 INTEGER ...
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.generateParseException(QueryParser.java:8963)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:8839)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.LimitClause(QueryParser.java:1656)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1280)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:682)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
 In fact 10L seems to work in the foreach generate construct.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780998#action_12780998
 ] 

Alan Gates commented on PIG-1077:
-

Patch checked into 0.6 branch.

 [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
 ---

 Key: PIG-1077
 URL: https://issues.apache.org/jira/browse/PIG-1077
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0, 0.7.0

 Attachments: patch_Pig1077


 TFile currently supports split by record sequence number (see Jira 
 HADOOP-6218). We want to utilize this to provide record(row)-based input 
 split support in Zebra.
 One prominent benefit is that: in cases where we have very large data files, 
 we can create much more fine-grained input splits than before where we can 
 only create one big split for one big file.
 In more detail, the new row-based getSplits() works by default (user does not 
 specify no. of splits to be generated) as follows: 
 1) Select the biggest column group in terms of data size, split all of its 
 TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
 physical byte offsets as the output per TFile. For example, let us assume for 
 the 1st TFile we get offset1, offset2, ..., offset10; 
 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
 key-value pair near a byte offset. For the example above, say we get 
 recordNum1, recordNum2, ..., recordNum10; 
 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
 recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
 respectively to form 11 record-based input splits for the 1st TFile. 
 4) For each input split, we need to create a TFile scanner through: 
 TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
 Note: conversion from byte offset to record number will be done by each 
 mapper, rather than being done at the job initialization phase. This is due 
 to performance concern since the conversion incurs some TFile reading 
 overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-21 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1077:


Fix Version/s: 0.6.0

 [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
 ---

 Key: PIG-1077
 URL: https://issues.apache.org/jira/browse/PIG-1077
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.6.0, 0.7.0

 Attachments: patch_Pig1077


 TFile currently supports split by record sequence number (see Jira 
 HADOOP-6218). We want to utilize this to provide record(row)-based input 
 split support in Zebra.
 One prominent benefit is that: in cases where we have very large data files, 
 we can create much more fine-grained input splits than before where we can 
 only create one big split for one big file.
 In more detail, the new row-based getSplits() works by default (user does not 
 specify no. of splits to be generated) as follows: 
 1) Select the biggest column group in terms of data size, split all of its 
 TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
 physical byte offsets as the output per TFile. For example, let us assume for 
 the 1st TFile we get offset1, offset2, ..., offset10; 
 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
 key-value pair near a byte offset. For the example above, say we get 
 recordNum1, recordNum2, ..., recordNum10; 
 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
 recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
 respectively to form 11 record-based input splits for the 1st TFile. 
 4) For each input split, we need to create a TFile scanner through: 
 TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
 Note: conversion from byte offset to record number will be done by each 
 mapper, rather than being done at the job initialization phase. This is due 
 to performance concern since the conversion incurs some TFile reading 
 overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Is Pig dropping records?

2009-11-21 Thread Dmitriy Ryaboy
Sam,
Can you post your changes to a Jira?
-D

On Fri, Nov 20, 2009 at 1:28 PM, Sam Rash s...@ning.com wrote:
 Hi,

 This reminds me of something else, though, that I took the latest patch for
 PIG-911 (sequence file reader) and found it skipped records

 https://issues.apache.org/jira/browse/PIG-911

 What I found is that the condition in getNext() would miss records:

 if (reader != null  (reader.getPosition()  end || !reader.syncSeen()) 
 reader.next(key, value)) {
 ...
 }

 I had to change it to:

 if (reader != null  reader.next(key,value)  (reader.getPosition()  end
 || !reader.syncSeen())) {
 ...
 }

 (also ended up breaking out to read(key) and get the below to support
 reading other types than Writable)

 This only happened when I file files pig read where more than one block; ie,
 the records dropped were around block boundaries.

 has anyone noticed this?

 thx,
 -sr

 Sam Rash
 s...@ning.com



 On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:

 Zaki,
 Glad to hear it wasn't Pig's fault!
 Can you post a description of what was going on with S3, or at least
 how you fixed it?

 -D

 On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman zaki.raha...@gmail.com
 wrote:
  Okay fixed some problem with corrupted file transfers from S3... now wc
  -l
  produces the same 143710 records... so yea its not a problem... and now
  I am
  getting the correct result from both methods... not sure what went
  wrong...
  thanks for the help though guys.
 
  On Thu, Nov 19, 2009 at 2:48 PM, Thejas Nair te...@yahoo-inc.com
  wrote:
 
  Another thing to verify is that clickurl's position in the schema is
  correct.
  -Thejas
 
 
 
  On 11/19/09 11:43 AM, Ashutosh Chauhan ashutosh.chau...@gmail.com
  wrote:
 
   Hmm... You are sure that your records are separated by /n (newline)
   and fields by /t (tab).  If so, will it be possible you to upload
   your
   dataset (possibly smaller) somewhere so that someone can take a look
   at that.
  
   Ashutosh
  
   On Thu, Nov 19, 2009 at 14:35, zaki rahaman zaki.raha...@gmail.com
  wrote:
   On Thu, Nov 19, 2009 at 2:24 PM, Ashutosh Chauhan 
   ashutosh.chau...@gmail.com wrote:
  
   Hi Zaki,
  
   Just to narrow down the problem, can you do:
  
   A = LOAD 's3n://bucket/*week.46*clickLog.2009*';
   dump A;
  
  
   This produced 143710 records;
  
  
   and
  
   A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS (
   timestamp:chararray,
   ip:chararray,
   userid:chararray,
   dist:chararray,
   clickid:chararray,
   usra:chararray,
   campaign:chararray,
   clickurl:chararray,
   plugin:chararray,
   tab:chararray,
   feature:chararray);
   dump A;
  
  
  
   This produced 143710 records (so no problem there);
  
  
   and
  
   cut -f8 *week.46*clickLog.2009* | wc -l
  
  
  
   This produced...
   175572
  
   Clearly, something is wrong...
  
  
   Thanks,
   Ashutosh
  
   On Thu, Nov 19, 2009 at 14:03, zaki rahaman
   zaki.raha...@gmail.com
   wrote:
   Hi All,
  
   I have the following mini-script running as part of a larger set
   of
   scripts/workflow... however it seems like pig is dropping records
   as
  when
   I
   tried running the same thing as a simple grep | wc -l I get a
  completely
   different result (2500 with Pig vs. 3300). The Pig script is as
  follows:
  
   A = LOAD 's3n://bucket/*week.46*clickLog.2009*' AS
   (timestamp:chararray,
   ip:chararray,
   userid:chararray,
   dist:chararray,
   clickid:chararray,
   usra:chararray,
   campaign:chararray,
   clickurl:chararray,
   plugin:chararray,
   tab:chararray,
   feature:chararray);
  
   B = FILTER raw BY clickurl matches '.*http://www.amazon.*';
  
   dump B produces the following output:
   2009-11-19 18:50:46,013 [main] INFO
  
  
 
  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
   er
   - Successfully stored result in:
   s3://kikin-pig-test/amazonoutput2
   2009-11-19 18:50:46,058 [main] INFO
  
  
 
  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
   er
   - Records written : 2502
   2009-11-19 18:50:46,058 [main] INFO
  
  
 
  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
   er
   - Bytes written : 0
   2009-11-19 18:50:46,058 [main] INFO
  
  
 
  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunch
   er
   - Success!
  
  
   The bash command is simply cut -f8 *week.46*clickLog.2009* | fgrep
   http://www.amazon | wc -l
  
   Both sets of inputs are the same files... and I'm not sure where
   the
   discrepency is coming from. Any help would be greatly appreciated.
  
   --
   Zaki Rahaman
  
  
  
  
  
   --
   Zaki Rahaman
  
 
 
 
 
  --
  Zaki Rahaman