[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Jerry Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated PIG-3404:


Attachment: PIG-3404.patch

Patch for reference

 Improve Pig to ignore bad files or inaccessible files or folders
 

 Key: PIG-3404
 URL: https://issues.apache.org/jira/browse/PIG-3404
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: PIG-3404.patch


 There are use cases in Pig:
 * A directory is used as the input of a load operation. It is possible that 
 one or more files in that directory are bad files (for example, corrupted or 
 bad data caused by compression).
 * A directory is used as the input of a load operation. The current user may 
 not have permission to access any subdirectories or files of that directory.
 The current Pig implementation will abort the whole Pig job for such cases. 
 It would be useful to have option to allow the job to continue and ignore the 
 bad files or inaccessible files/folders without abort the job, ideally, log 
 or print a warning for such error or violations. This requirement is not 
 trivial because for big data set for large analytics applications, this is 
 not always possible to sort out the  good data for processing; Ignore a few 
 of bad files may be a better choice for such situations.
 We propose to use “Ignore bad files” flag to address this problem. 
 AvroStorage and related file format in Pig already has this flag but it is 
 not complete to cover all the cases mentioned above. We would improve the 
 PigStorage and related text format to support this new flag as well as 
 improve AvroStorage and related facilities to completely support the concept.
 The flag is “Storage” (For example, PigStorage or AvroStorage) based and can 
 be set for each load operation respectively. The value of this flag will be 
 false if it is not explicitly set. Ideally, we can provide a global pig 
 parameter which forces the default value to true for all load functions even 
 if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Jerry Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777186#comment-13777186
 ] 

Jerry Chen commented on PIG-3404:
-

Hi Park, sorry for the late response and I am glad that we can discuss this 
topic further in this JIRA. 

Just as mentioned in the JIRA description, we are taking the approach of 
“Ignore bad files” flag for each storage. Different storages can be controlled 
separately instead of a global flag.  On the other hand, in our use cases, we 
also want to take care that the current user may not have permission to access 
any subdirectories of the input directory, which can be looked as “bad 
directory” in concept.

Another thing is the ignore ratio. We currently take an even simpler approach 
of “ignore all” or “ignore nothing” using a flag. Just as you mentioned, 
PIG-3059 uses a threshold to control how many bad input splits can be ignored. 
This is a good thing. While the question is “How many cases in reality that we 
need a ratio is not 0 and 1?” 

I went through the patch in PIG-3059. I was trying to understand how the ratio 
is controlled globally in a distributed MapReduce task environment. It seems 
that in InputErrorTracker.java, you use a local variable (numErrors) for error 
tracking. I may miss something there but it would be very helpful if you can 
help explain.

Thank you too for providing the helpful information and let’s continue the 
discussion. 


 Improve Pig to ignore bad files or inaccessible files or folders
 

 Key: PIG-3404
 URL: https://issues.apache.org/jira/browse/PIG-3404
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: PIG-3404.patch


 There are use cases in Pig:
 * A directory is used as the input of a load operation. It is possible that 
 one or more files in that directory are bad files (for example, corrupted or 
 bad data caused by compression).
 * A directory is used as the input of a load operation. The current user may 
 not have permission to access any subdirectories or files of that directory.
 The current Pig implementation will abort the whole Pig job for such cases. 
 It would be useful to have option to allow the job to continue and ignore the 
 bad files or inaccessible files/folders without abort the job, ideally, log 
 or print a warning for such error or violations. This requirement is not 
 trivial because for big data set for large analytics applications, this is 
 not always possible to sort out the  good data for processing; Ignore a few 
 of bad files may be a better choice for such situations.
 We propose to use “Ignore bad files” flag to address this problem. 
 AvroStorage and related file format in Pig already has this flag but it is 
 not complete to cover all the cases mentioned above. We would improve the 
 PigStorage and related text format to support this new flag as well as 
 improve AvroStorage and related facilities to completely support the concept.
 The flag is “Storage” (For example, PigStorage or AvroStorage) based and can 
 be set for each load operation respectively. The value of this flag will be 
 false if it is not explicitly set. Ideally, we can provide a global pig 
 parameter which forces the default value to true for all load functions even 
 if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)
zhenyingshan created PIG-3481:
-

 Summary: Unable to check name message
 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Reporter: zhenyingshan


I am trying to run a pig script in Java program, I get the following error 
sometimes but not all the time.  Here is the snippet of the program and the 
exception I've got.  I have /user/root directory created in hdfs.

--

URL path = 
getClass().getClassLoader().getResource(cfg/concatall.py); 

LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
path.toString());

pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
CDNResolve2Hbase);
pigServer.registerQuery(A = load ' + inputPath + ' using 
PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
firsttime:chararray, updatetime:chararray););
pigServer.registerCode(path.toString(),jython,myfunc);
pigServer.registerQuery(B = foreach A generate 
myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
SUBSTRING(firsttime,0,8););
outputTable = hbase:// + outputTable;
ExecJob job = 
pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
 d:dtime'));
  


-
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
parsing. Unable to check name hdfs://DC-001:9000/user/root
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
Source)
at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
at 
org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
at 
org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
Caused by: Failed to parse: Pig script failed to parse: 
line 6, column 4 pig script failed to validate: 
org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to 
check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
... 15 more
Caused by: 
line 6, column 4 pig script failed to validate: 
org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to 
check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
at 
org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
at 
org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
at 
org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
at 
org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
at 
org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
at 
org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
... 16 more
Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: 
Unable to check name hdfs://DC-001:9000/user/root
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
at 
org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:91)
at 
org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:827)
... 22 more
Caused by: java.io.IOException: Filesystem closed
 

[jira] [Updated] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan updated PIG-3481:
--

Affects Version/s: 0.11.1

 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: zhenyingshan

 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
   ... 16 more
 Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 
 6007: Unable to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138)
   at 
 

[jira] [Updated] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan updated PIG-3481:
--

Environment: hadoop 1.0.3, hbase 0.94.1

 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
 Environment: hadoop 1.0.3, hbase 0.94.1
Reporter: zhenyingshan

 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
   ... 16 more
 Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 
 6007: Unable to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207)
   at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128)
   at 
 

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777343#comment-13777343
 ] 

Brian ONeill commented on PIG-3453:
---

First question, for DISTINCT within Storm, do you believe we should have a 
sliding time window within which we perform the distinct?  There is mention of 
the fact that it will be stateful (since we need to keep a set in memory with 
which to de-dupe).  Do we intend to leverage the concept of Trident State for 
this? (which may make sense, implement State then on each commit/flush perform 
the de-duping)

thoughts?

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777355#comment-13777355
 ] 

Brian ONeill commented on PIG-3453:
---

Also, we could perform DISTINCT using a backend storage mechanism (like 
Cassandra), where we first check storage to see if the tuple exists, if it does 
not, we emit.  If we first route all the same tuples to a single bolt, then 
check from there that may work (eliminating the potential for two bolts to 
check for existence at the same time).  Using backend storage would allow 
someone to perform a true DISTINCT operation.

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3481) Unable to check name message

2013-09-25 Thread zhenyingshan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenyingshan resolved PIG-3481.
---

   Resolution: Fixed
Fix Version/s: 0.11.1

It turns out that because my static PigServer is not thread-safe, I use 
ThreadLocal and the problem does not appear anymore.

--
private static ThreadLocalPigServer pigServer = new ThreadLocalPigServer();

public static PigServer getServer() {
if (pigServer.get() == null) {
try {
printClassPath();
Properties prop = SystemUtils.getCfg();
pigServer.set(new PigServer(ExecType.MAPREDUCE, prop));
return pigServer.get();
} catch (Exception e) {
LOG.error(error in starting PigServer:, e);
return null;
}
}
return pigServer.get();
}


 Unable to check name message
 --

 Key: PIG-3481
 URL: https://issues.apache.org/jira/browse/PIG-3481
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
 Environment: hadoop 1.0.3, hbase 0.94.1
Reporter: zhenyingshan
 Fix For: 0.11.1


 I am trying to run a pig script in Java program, I get the following error 
 sometimes but not all the time.  Here is the snippet of the program and the 
 exception I've got.  I have /user/root directory created in hdfs.
 --
   URL path = 
 getClass().getClassLoader().getResource(cfg/concatall.py); 
   
   LOG.info(CDNResolve2Hbase: reading concatall.py file from  + 
 path.toString());
   
 pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME,
   CDNResolve2Hbase);
   pigServer.registerQuery(A = load ' + inputPath + ' using 
 PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, 
 firsttime:chararray, updatetime:chararray););
   pigServer.registerCode(path.toString(),jython,myfunc);
   pigServer.registerQuery(B = foreach A generate 
 myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, 
 SUBSTRING(firsttime,0,8););
   outputTable = hbase:// + outputTable;
   ExecJob job = 
 pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn
  d:dtime'));
   
 -
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Unable to check name hdfs://DC-001:9000/user/root
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:529)
   at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown 
 Source)
   at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source)
   at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown 
 Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
   at 
 org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264)
   at 
 org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:203)
   at 
 org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520)
 Caused by: Failed to parse: Pig script failed to parse: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
   ... 15 more
 Caused by: 
 line 6, column 4 pig script failed to validate: 
 org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable 
 to check name hdfs://DC-001:9000/user/root
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236)
   at 
 

[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Jacob Perkins (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777408#comment-13777408
 ] 

Jacob Perkins commented on PIG-3453:


[~boneill], I haven't thought too hard about distinct yet myself. Since I'm 
really only thinking about Trident and not storm in general, doing a distinct 
strictly within a batch is one straightforward option. Unfortunately, from a 
user standpoint, I think this would be (a) minimally useful and (b) confusing. 
Instead we could implement something like an approximate distinct using an LRU 
cache? Maybe even go so far as to implement a SQF (which I haven't read in its 
entirety yet): http://www.vldb.org/pvldb/vol6/p589-dutta.pdf?

Also, what about order by? In what sense is an unbounded stream ordered?

I absolutely do not want to tie the storm/trident execution engine to an 
external data store such as cassandra. Pig is supposed to be backend agnostic. 
Maybe the -default- tap and sink can be Kafka (tap) and Cassandra (sink). 
Finally, it should be possible to run a pig script in storm local mode.

And [~pradeepg26] I'm actually well on the way to having nested foreach 
working. They way I'm working it now is each LogicalExpressionPlan becomes its 
own Trident BaseFunction. Actually works quite nicely for now. I haven't gotten 
to aggregates yet. What I probably won't implement for the POC is the tap and 
sink.

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig

2013-09-25 Thread Brian ONeill (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777421#comment-13777421
 ] 

Brian ONeill commented on PIG-3453:
---

[~thedatachef] Good points/suggestions.  I'll have a look at bot LRU and SQF.

RE: Cassandra
Sorry, I didn't mean to imply we would create a hard dependency.  I meant we 
could leverage the Trident State abstraction.  (My team happens to own the 
storm-cassandra Trident State implementation 
(https://github.com/hmsonline/storm-cassandra))  We would query the State to 
see if the tuple was processed.  You could just as easily plug in any 
persistence mechanism.  (e.g. https://github.com/nathanmarz/trident-memcached)  

 Implement a Storm backend to Pig
 

 Key: PIG-3453
 URL: https://issues.apache.org/jira/browse/PIG-3453
 Project: Pig
  Issue Type: New Feature
Reporter: Pradeep Gollakota
  Labels: storm

 There is a lot of interest around implementing a Storm backend to Pig for 
 streaming processing. The proposal and initial discussions can be found at 
 https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-25 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3458:
--

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to branch-0.12 and trunk.  
Thanks Mark and Daniel for your feedback!

 ScalarExpression lost with multiquery optimization
 --

 Key: PIG-3458
 URL: https://issues.apache.org/jira/browse/PIG-3458
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Fix For: 0.12.0

 Attachments: pig-3458-v01.patch, pig-3458-v02.patch


 Our user reported an issue where their scalar results goes missing when 
 having two store statements.
 {noformat}
 A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
 B = group A all;
 C = foreach B generate SUM(A.count) as total ;
 store C into 'deleteme6_C' using PigStorage(',');
 Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
 Y = group Z by id;
 X = foreach Y generate group, C.total;
 store X into 'deleteme6_X' using PigStorage(',');
 Inputs
  pig cat test1.txt
 a   1
 b   2
 c   8
 d   9
  pig cat test2.txt
 a   z
 b   y
 c   x
  pig
 {noformat}
 Result X should contain the total count of '20' but instead it's empty.
 {noformat}
  pig cat deleteme6_C/part-r-0
 20
  pig cat deleteme6_X/part-r-0
 x,
 y,
 z,
  pig
 {noformat}
 This works if we take out first store C statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-25 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3458:
--

Component/s: parser

 ScalarExpression lost with multiquery optimization
 --

 Key: PIG-3458
 URL: https://issues.apache.org/jira/browse/PIG-3458
 Project: Pig
  Issue Type: Bug
  Components: parser
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Fix For: 0.12.0

 Attachments: pig-3458-v01.patch, pig-3458-v02.patch


 Our user reported an issue where their scalar results goes missing when 
 having two store statements.
 {noformat}
 A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
 B = group A all;
 C = foreach B generate SUM(A.count) as total ;
 store C into 'deleteme6_C' using PigStorage(',');
 Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
 Y = group Z by id;
 X = foreach Y generate group, C.total;
 store X into 'deleteme6_X' using PigStorage(',');
 Inputs
  pig cat test1.txt
 a   1
 b   2
 c   8
 d   9
  pig cat test2.txt
 a   z
 b   y
 c   x
  pig
 {noformat}
 Result X should contain the total count of '20' but instead it's empty.
 {noformat}
  pig cat deleteme6_C/part-r-0
 20
  pig cat deleteme6_X/part-r-0
 x,
 y,
 z,
  pig
 {noformat}
 This works if we take out first store C statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-19) A=load causes parse error

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-19:
--

Fix Version/s: (was: 0.12.0)
   0.13.0

 A=load causes parse error
 -

 Key: PIG-19
 URL: https://issues.apache.org/jira/browse/PIG-19
 Project: Pig
  Issue Type: Bug
  Components: grunt
Reporter: Olga Natkovich
Assignee: Gianmarco De Francisci Morales
Priority: Minor
 Fix For: 0.13.0


 Parser expects spaces around =. This should be a minor change in 
 src/org/apache/pig/tools/grunt/GruntParser.jj

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1151) Date Conversion + Arithmetic UDFs

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1151:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Date Conversion + Arithmetic UDFs
 -

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Reporter: sam rash
Priority: Minor
  Labels: patch
 Fix For: 0.13.0

 Attachments: patch_dateudf.tar.gz


 I would like to offer up some very simple data UDFs I have that wrap JodaTime 
 (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
 operate on ISO8601 date strings.
 (for piggybank).  Please advise if these are appropriate.
 1. Date Arithmetic
 takes an input string: 
 2009-01-01T13:43:33.000Z
 (and partial ones such as 2009-01-02)
 and a timespan (as millis or as string shorthand)
 returns an ISO8601 string that adjusts the input date by the specified 
 timespan
 DatePlus(long timeMs); // + or - number works, is the # of millis
 DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
 DateMinus(String timespan); //propose explicit minus when using string 
 shorthand for time periods
 2. Date Comparison (when you don't have full strings that you can use string 
 compare with):
 DateIsBefore(String dateString); //true if lhs is before rhs
 DateIsAfter(String dateString); //true if lsh is after rhs
 3. date trunc functions:
 takes partial ISO8601 strings and truncates to:
 toMinute(String dateString);
 toHour(String dateString);
 toDay(String dateString);
 toWeek(String dateString);
 toMonth(String dateString);
 toYear(String dateString);
 if any/all are helpful, I'm happy to contribute to pig

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1967) deprecate current syntax for casting relation as scalar, to use explicit cast to tuple

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1967:


Fix Version/s: (was: 0.12.0)
   0.13.0

 deprecate current syntax for casting relation as scalar, to use explicit cast 
 to tuple
 --

 Key: PIG-1967
 URL: https://issues.apache.org/jira/browse/PIG-1967
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0, 0.9.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.13.0

 Attachments: PIG-1967.0.patch


 When the feature added in PIG-1434, there was a proposal to cast it to tuple, 
 to be able to use as a scalar. But for some reason, in the implementation 
 this cast was not required.
 See -
 https://issues.apache.org/jira/browse/PIG-1434?focusedCommentId=12888449page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12888449
 The current syntax which does not need this cast seems to lead to lot of 
 confusion among users, who end up using this feature unintentionally. This 
 usually happens because the user is referring to the bag column(s) in output 
 of (co)group using a wrong name, which happens to be another relation. Often, 
 users realize the mistake only at runtime. New users, will have trouble 
 figuring out what was wrong.
 I think we should support the use of the cast as originally proposed, and 
 deprecate the current syntax. The warning issued when the deprecated syntax 
 is used is likely to help users realize that they have unintentionally used 
 this feature.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1919) order-by on bag gives error only at runtime

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1919:


Fix Version/s: (was: 0.12.0)
   0.13.0

 order-by on bag gives error only at runtime
 ---

 Key: PIG-1919
 URL: https://issues.apache.org/jira/browse/PIG-1919
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0, 0.9.0
Reporter: Thejas M Nair
Assignee: Jonathan Coveney
 Fix For: 0.13.0

 Attachments: PIG-1919-0.patch, PIG-1919-1.patch, PIG-1919-1.patch


 Order-by on a bag or tuple should give error at query compile time, instead 
 of giving an error at runtime.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2409) Pig show wrong tracking URL for hadoop 2

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2409:


Summary: Pig show wrong tracking URL for hadoop 2  (was: Tracking URL for 
hadoop 23 does not show up)

 Pig show wrong tracking URL for hadoop 2
 

 Key: PIG-2409
 URL: https://issues.apache.org/jira/browse/PIG-2409
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.2, 0.10.0, 0.11
Reporter: Daniel Dai
Assignee: Daniel Dai
Priority: Minor
  Labels: hadoop023
 Fix For: 0.12.0


 Pig used to show a tracking url for hadoop job:
 More information at: 
 http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001
 This information does not show up in hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2122) Parameter Substitution doesn't work in the Grunt shell

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2122:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Parameter Substitution doesn't work in the Grunt shell
 --

 Key: PIG-2122
 URL: https://issues.apache.org/jira/browse/PIG-2122
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0, 0.8.1, 0.12.0
Reporter: Grant Ingersoll
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.13.0


 Simple param substitution and things like %declare (as copied out of the 
 docs) don't work in the grunt shell.
 #Start Pig with: Start Pig with: bin/pig -x local -p time=FOO
 {quote}
 foo = LOAD '/user/grant/foo.txt' AS (a:chararray, b:chararray, c:chararray);
 Y = foreach foo generate *, '$time';
 dump Y;
 {quote}
 Output:
 {quote}
 2011-06-13 20:22:24,197 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
 paths to process : 1
 (1 2 3,,,$time)
 (4 5 6,,,$time)
 {quote}
 Same script, stored in junk.pig, run as: bin/pig -x local -p time=FOO junk.pig
 {quote}
 2011-06-13 20:23:38,864 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input 
 paths to process : 1
 (1 2 3,,,FOO)
 (4 5 6,,,FOO)
 {quote}
 Also, things like don't work (nor does %declare):
 {quote}
 grunt %default DATE '20090101';
 2011-06-13 20:18:19,943 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Encountered  PATH %default  at line 1, 
 column 1.
 Was expecting one of:
 EOF 
 cat ...
 fs ...
 sh ...
 cd ...
 cp ...
 copyFromLocal ...
 copyToLocal ...
 dump ...
 describe ...
 aliases ...
 explain ...
 help ...
 kill ...
 ls ...
 mv ...
 mkdir ...
 pwd ...
 quit ...
 register ...
 rm ...
 rmf ...
 set ...
 illustrate ...
 run ...
 exec ...
 scriptDone ...
  ...
 EOL ...
 ; ...
 
 Details at logfile: 
 /Users/grant.ingersoll/projects/apache/pig/release-0.8.1/pig_1308002917912.log
 {quote}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2247) Pig parser does not detect multiple arguments with the same name passed to macro

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2247:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Pig parser does not detect multiple arguments with the same name passed to 
 macro
 

 Key: PIG-2247
 URL: https://issues.apache.org/jira/browse/PIG-2247
 Project: Pig
  Issue Type: Bug
  Components: parser
Affects Versions: 0.9.0
Reporter: Alan Gates
Assignee: Johnny Zhang
Priority: Minor
 Fix For: 0.13.0

 Attachments: PIG-2247.patch.txt


 Pig accepts a macro like
 {code}
 define simple_macro(in_relation, min_gpa, min_gpa) returns c {
   b = filter $in_relation by gpa = $min_gpa;
   $c = foreach b generate age, name;
 {code}
 This should produce an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2409) Pig show wrong tracking URL for hadoop 2

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2409:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Pig show wrong tracking URL for hadoop 2
 

 Key: PIG-2409
 URL: https://issues.apache.org/jira/browse/PIG-2409
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.2, 0.10.0, 0.11
Reporter: Daniel Dai
Assignee: Daniel Dai
Priority: Minor
  Labels: hadoop023
 Fix For: 0.13.0


 Pig used to show a tracking url for hadoop job:
 More information at: 
 http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001
 This information does not show up in hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2409) Pig show wrong tracking URL for hadoop 2

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777891#comment-13777891
 ] 

Daniel Dai commented on PIG-2409:
-

Hadoop 2 shows the right tracking url now. However, Pig will print a redundant 
message which contains a wrong url. We need to remove it in Pig on Hadoop 2.

 Pig show wrong tracking URL for hadoop 2
 

 Key: PIG-2409
 URL: https://issues.apache.org/jira/browse/PIG-2409
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.2, 0.10.0, 0.11
Reporter: Daniel Dai
Assignee: Daniel Dai
Priority: Minor
  Labels: hadoop023
 Fix For: 0.12.0


 Pig used to show a tracking url for hadoop job:
 More information at: 
 http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001
 This information does not show up in hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2495) Using merge JOIN from a HBaseStorage produces an error

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2495:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Using merge JOIN from a HBaseStorage produces an error
 --

 Key: PIG-2495
 URL: https://issues.apache.org/jira/browse/PIG-2495
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.1, 0.9.2
 Environment: HBase 0.90.3, Hadoop 0.20-append
Reporter: Kevin Lion
Assignee: Kevin Lion
 Fix For: 0.13.0

 Attachments: PIG-2495.patch


 To increase performance of my computation, I would like to use a merge join 
 between two tables to increase speed computation but it produces an error.
 Here is the script:
 {noformat}
 start_sessions = LOAD 'hbase://startSession.bea00.dev.ubithere.com' USING 
 org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:infoid meta:imei 
 meta:timestamp', '-loadKey') AS (sid:chararray, infoid:chararray, 
 imei:chararray, start:long);
 end_sessions = LOAD 'hbase://endSession.bea00.dev.ubithere.com' USING 
 org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:timestamp meta:locid', 
 '-loadKey') AS (sid:chararray, end:long, locid:chararray);
 sessions = JOIN start_sessions BY sid, end_sessions BY sid USING 'merge';
 STORE sessions INTO 'sessionsTest' USING PigStorage ('*');
 {noformat} 
 Here is the result of this script :
 {noformat}
 2012-01-30 16:12:43,920 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /root/pig_1327939963919.log
 2012-01-30 16:12:44,025 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://lxc233:9000
 2012-01-30 16:12:44,102 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: lxc233:9001
 2012-01-30 16:12:44,760 [main] INFO  
 org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: 
 MERGE_JION
 2012-01-30 16:12:44,923 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 
 File concatenation threshold: 100 optimistic? false
 2012-01-30 16:12:44,982 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size before optimization: 2
 2012-01-30 16:12:44,982 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
  - MR plan size after optimization: 2
 2012-01-30 16:12:45,001 [main] INFO  
 org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to 
 the job
 2012-01-30 16:12:45,006 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
  - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 environment:host.name=lxc233.machine.com
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 environment:java.version=1.6.0_22
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 environment:java.vendor=Sun Microsystems Inc.
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.22/jre
 2012-01-30 16:12:45,039 [main] INFO  org.apache.zookeeper.ZooKeeper - Client 
 

[jira] [Updated] (PIG-3021) Split results missing records when there is null values in the column comparison

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3021:
---

Fix Version/s: (was: 0.12.0)
   0.13.0

Moving it to 0.13 since the 0.12 branch is already cut.

 Split results missing records when there is null values in the column 
 comparison
 

 Key: PIG-3021
 URL: https://issues.apache.org/jira/browse/PIG-3021
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
Reporter: Chang Luo
Assignee: Cheolsoo Park
 Fix For: 0.13.0

 Attachments: PIG-3021-2.patch, PIG-3021-3.patch, PIG-3021.patch


 Suppose a(x, y)
 split a into b if x==y, c otherwise;
 One will expect the union of b and c will be a.  However, if x or y is null, 
 the record won't appear in either b or c.
 To workaround this, I have to change to the following:
 split a into b if x is not null and y is not null and x==y, c otherwise;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2461) Simplify schema syntax for cast

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2461:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Simplify schema syntax for cast
 ---

 Key: PIG-2461
 URL: https://issues.apache.org/jira/browse/PIG-2461
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
 Fix For: 0.13.0


 Cast into a bag/tuple syntax is confusing:
 {code}
 b = foreach a generate (bag{tuple(int,double)})bag0;
 {code}
 It's pretty hard to get it right for users. We should make key word 
 bag/tuple optional.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3370:
---

Fix Version/s: (was: 0.12.0)
   0.13.0

Moving it to 0.13 for now.

 Add New Reserved Keywords To The Pig Docs
 -

 Key: PIG-3370
 URL: https://issues.apache.org/jira/browse/PIG-3370
 Project: Pig
  Issue Type: Task
  Components: documentation, parser
Reporter: Sergey Goder
Assignee: Cheolsoo Park
Priority: Trivial
 Fix For: 0.13.0


 The following are reserved keywords in Pig that are not included in the 11.1 
 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords)
 cube, dense, rank, returns, rollup, void
 Please add to any that I may have overlooked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2446) Fix map input bytes for hadoop 20.203+

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2446:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Fix map input bytes for hadoop 20.203+
 --

 Key: PIG-2446
 URL: https://issues.apache.org/jira/browse/PIG-2446
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.2, 0.10.0, 0.11
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.13.0


 From hadoop 20.203+, HDFS_BYTES_READ change the meaning. It no longer means 
 the size of input files, it is the total hdfs bytes read for the job. Pig 
 need a way to get the map input bytes to retain the old behavior.
 TestPigRunner.testGetHadoopCounters is testing that and is temporary disabled 
 for hadoop 203+.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2584) Command line arguments for Pig script

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2584:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Command line arguments for Pig script
 -

 Key: PIG-2584
 URL: https://issues.apache.org/jira/browse/PIG-2584
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Priority: Minor
 Fix For: 0.13.0


 We did that for Jython embeded script. It is also useful in Pig script itself:
 command line: pig a.pig student.txt output
 a.pig:
 a = load '$1' as (a0, a1);
 store a into '$2';

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2521) explicit reference to namenode path with streaming results in an error

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2521:


Fix Version/s: (was: 0.12.0)
   0.13.0

 explicit reference to namenode path with streaming results in an error
 --

 Key: PIG-2521
 URL: https://issues.apache.org/jira/browse/PIG-2521
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.2
Reporter: Araceli Henley
Priority: Minor
 Fix For: 0.13.0


 I set this to minor because this test works with client side tables and with 
 old style references.
 ::
 /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327441396/dotNext_baseline_15.pig
 ::
 THIS TEST FAILS. It uses an explicit reference to namenode1 
 (hdfs://namenode1.domain.com:8020)
 define CMD `perl PigStreamingDepend.pl` input(stdin) 
 ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl',
  
 '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm');
 A = load 'hdfs://namdenode1.domain.com:8020/user/hadoopqa/pig/tests/data';
 B = stream A through `perl PigStreaming.pl`;
 C = stream B through CMD as (name, age, gpa);
 D = foreach C generate name, age;
 store D into 
 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out';
 fs -cp 
 hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out
  /user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out
 ::
 /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327441396/dotNext_baseline_1.pig
 ::
 This test PASSES. It uses an explicit reference to 
 NN1(hdfs://namenode1.domain.com:8020) for load and store
 a = load 
 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/tests/data/singlefile/studenttab10k'
  as (name, age, gpa);
 store a into 
 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out'
  ;
 fs -cp 
 hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out
  /user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out
 THE REMAINING TESTS ARE IDENTICAL EXCEPT FOR THE FILE REFERNCE: explicit vs 
 mount point
 ::
  
 /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327433551/dotNext_baseline_15.pig
 ::
 This test PASSES. Its the baseline for the test, it uses old style references.
 define CMD `perl PigStreamingDepend.pl` input(stdin) 
 ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl',
  
 '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm');
 A = load '/user/hadoopqa/pig/tests/data';
 B = stream A through `perl PigStreaming.pl`;
 C = stream B through CMD as (name, age, gpa);
 D = foreach C generate name, age;
 store D into 
 '/user/hadoopqa/pig/out/hadoopqa.1327433551/dotNext_baseline_15.out';
 ::
 grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327431567/dotNext_baseline_15.pig
 ::
 This test PASSES. It uses a mount point to namenode 1( /data1 is a mount 
 point for hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/tests/data).
 define CMD `perl PigStreamingDepend.pl` input(stdin) 
 ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl',
  
 '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm');
 A = load '/data1';
 B = stream A through `perl PigStreaming.pl`;
 C = stream B through CMD as (name, age, gpa);
 D = foreach C generate name, age;
 store D into 
 '/out1/user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out';
 fs -cp 
 /out1/user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out 
 /user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2537) Output from flatten with a null tuple input generating data inconsistent with the schema

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2537:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Output from flatten with a null tuple input generating data inconsistent with 
 the schema
 

 Key: PIG-2537
 URL: https://issues.apache.org/jira/browse/PIG-2537
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0, 0.9.0
Reporter: Xuefu Zhang
Assignee: Daniel Dai
 Fix For: 0.13.0

 Attachments: PIG-2537-1.patch, PIG-2537-2.patch, PIG-2537-3.patch


 For the following pig script,
 grunt A = load 'file' as ( a : tuple( x, y, z ), b, c );
 grunt B = foreach A generate flatten( $0 ), b, c;
 grunt describe B;
 B: {a::x: bytearray,a::y: bytearray,a::z: bytearray,b: bytearray,c: bytearray}
 Alias B has a clear schema.
 However, on the backend, for a row if $0 happens to be null, then output 
 tuple become something like 
 (null, b_value, c_value), which is obviously inconsistent with the schema. 
 The behaviour is confirmed by pig code inspection. 
 This inconsistency corrupts data because of position shifts. Expected output 
 row should be something like
 (null, null, null, b_value, c_value).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2599) Mavenize Pig

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2599:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Mavenize Pig
 

 Key: PIG-2599
 URL: https://issues.apache.org/jira/browse/PIG-2599
 Project: Pig
  Issue Type: New Feature
  Components: build
Reporter: Daniel Dai
  Labels: gsoc2013
 Fix For: 0.13.0

 Attachments: maven-pig.1.zip


 Switch Pig build system from ant to maven.
 This is a candidate project for Google summer of code 2013. More information 
 about the program can be found at 
 https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2625) Allow use of JRuby for control flow

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2625:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Allow use of JRuby for control flow
 ---

 Key: PIG-2625
 URL: https://issues.apache.org/jira/browse/PIG-2625
 Project: Pig
  Issue Type: New Feature
Reporter: Jonathan Coveney
 Fix For: 0.13.0


 Much like people can use jython for iterative computation, it'd be great to 
 use JRuby for the same

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-09-25 Thread Jeremy Karn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Karn updated PIG-2417:
-

Attachment: PIG-2417-unicode.patch

 Streaming UDFs -  allow users to easily write UDFs in scripting languages 
 with no JVM implementation.
 -

 Key: PIG-2417
 URL: https://issues.apache.org/jira/browse/PIG-2417
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Jeremy Karn
Assignee: Jeremy Karn
 Fix For: 0.12.0

 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, 
 PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, 
 PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, 
 streaming2.patch, streaming3.patch, streaming.patch


 The goal of Streaming UDFs is to allow users to easily write UDFs in 
 scripting languages with no JVM implementation or a limited JVM 
 implementation.  The initial proposal is outlined here: 
 https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
 In order to implement this we need new syntax to distinguish a streaming UDF 
 from an embedded JVM UDF.  I'd propose something like the following (although 
 I'm not sure 'language' is the best term to be using):
 {code}define my_streaming_udfs language('python') 
 ship('my_streaming_udfs.py'){code}
 We'll also need a language-specific controller script that gets shipped to 
 the cluster which is responsible for reading the input stream, deserializing 
 the input data, passing it to the user written script, serializing that 
 script output, and writing that to the output stream.
 Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
 class will likely share some of the existing code in POStream and 
 ExecutableManager (where it make sense to pull out shared code) to stream 
 data to/from the controller script.
 One alternative approach to creating the StreamingUDF EvalFunc is to use the 
 POStream operator directly.  This would involve inserting the POStream 
 operator instead of the POUserFunc operator whenever we encountered a 
 streaming UDF while building the physical plan.  This approach seemed 
 problematic because there would need to be a lot of changes in order to 
 support POStream in all of the places we want to be able use UDFs (For 
 example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2591:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Unit tests should not write to /tmp but respect java.io.tmpdir
 --

 Key: PIG-2591
 URL: https://issues.apache.org/jira/browse/PIG-2591
 Project: Pig
  Issue Type: Bug
  Components: tools
Reporter: Thomas Weise
Assignee: Jarek Jarcec Cecho
 Fix For: 0.13.0

 Attachments: bugPIG-2591.patch, PIG-2495.patch


 Several tests use /tmp but should derive temporary file location from 
 java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test 
 run specific location in build.xml)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2595) BinCond only works inside parentheses

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2595:


Fix Version/s: (was: 0.12.0)
   0.13.0

 BinCond only works inside parentheses
 -

 Key: PIG-2595
 URL: https://issues.apache.org/jira/browse/PIG-2595
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
 Fix For: 0.13.0


 Not sure if we have a Jira for this before. This script does not work:
 {code}
 a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() 
 as (name, age:int, gpa:double, instate:chararray);
 b = foreach a generate name, instate=='true'?gpa:gpa+1;
 dump b;
 {code}
 If we put bincond into parentheses, it works
 {code}
 a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() 
 as (name, age:int, gpa:double, instate:chararray);
 b = foreach a generate name, (instate=='true'?gpa:gpa+1);
 dump b;
 {code}
 Exception:
 ERROR 1200: file 40.pig, line 2, column 36  mismatched input '==' expecting 
 SEMI_COLON
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. file 40.pig, line 2, column 36  mismatched input '==' expecting 
 SEMI_COLON
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1541)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:541)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:945)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:392)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
 at org.apache.pig.Main.run(Main.java:599)
 at org.apache.pig.Main.main(Main.java:153)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by: Failed to parse: file 40.pig, line 2, column 36  mismatched 
 input '==' expecting SEMI_COLON
 at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226)
 at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1590)
 ... 14 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2624) Handle recursive inclusion of scripts in JRuby UDFs

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2624:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Handle recursive inclusion of scripts in JRuby UDFs
 ---

 Key: PIG-2624
 URL: https://issues.apache.org/jira/browse/PIG-2624
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.10.0, 0.11
Reporter: Jonathan Coveney
  Labels: JRuby
 Fix For: 0.13.0


 Currently, if you have a script which require's another script, the 
 dependency won't be properly handled.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-09-25 Thread Jeremy Karn (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777997#comment-13777997
 ] 

Jeremy Karn commented on PIG-2417:
--

I can't reproduce the problem but I think PIG-2417-unicode.patch should fix the 
encoding issue.

 Streaming UDFs -  allow users to easily write UDFs in scripting languages 
 with no JVM implementation.
 -

 Key: PIG-2417
 URL: https://issues.apache.org/jira/browse/PIG-2417
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Jeremy Karn
Assignee: Jeremy Karn
 Fix For: 0.12.0

 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, 
 PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, 
 PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, 
 streaming2.patch, streaming3.patch, streaming.patch


 The goal of Streaming UDFs is to allow users to easily write UDFs in 
 scripting languages with no JVM implementation or a limited JVM 
 implementation.  The initial proposal is outlined here: 
 https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
 In order to implement this we need new syntax to distinguish a streaming UDF 
 from an embedded JVM UDF.  I'd propose something like the following (although 
 I'm not sure 'language' is the best term to be using):
 {code}define my_streaming_udfs language('python') 
 ship('my_streaming_udfs.py'){code}
 We'll also need a language-specific controller script that gets shipped to 
 the cluster which is responsible for reading the input stream, deserializing 
 the input data, passing it to the user written script, serializing that 
 script output, and writing that to the output stream.
 Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
 class will likely share some of the existing code in POStream and 
 ExecutableManager (where it make sense to pull out shared code) to stream 
 data to/from the controller script.
 One alternative approach to creating the StreamingUDF EvalFunc is to use the 
 POStream operator directly.  This would involve inserting the POStream 
 operator instead of the POUserFunc operator whenever we encountered a 
 streaming UDF while building the physical plan.  This approach seemed 
 problematic because there would need to be a lot of changes in order to 
 support POStream in all of the places we want to be able use UDFs (For 
 example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2628) Allow in line scripting UDF definitions

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2628:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Allow in line scripting UDF definitions
 ---

 Key: PIG-2628
 URL: https://issues.apache.org/jira/browse/PIG-2628
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
 Fix For: 0.13.0


 For small udfs in scripting languages, it may be cumbersome to force users to 
 make a script, put it on the classpath, ship it, etc. It would be great to 
 support a syntax that allows people to declare UDFs in line (essentially, to 
 define a snippet of code that will be interpreted as a scriptlet)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2630) Issue with setting b = a;

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2630:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Issue with setting b = a;
 ---

 Key: PIG-2630
 URL: https://issues.apache.org/jira/browse/PIG-2630
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0, 0.11
Reporter: Jonathan Coveney
 Fix For: 0.13.0


 The following gives an error:
 {code}
 a = load 'thing' as (x:int);
 b = a; c = join a by x, b by x;
 {code}
 Error:
 {code}
 2012-04-03 14:02:47,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1200: Pig script failed to parse: 
 line 14, column 4 pig script failed to validate: 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection 
 with nothing to reference!
 {code}
 No issue with the following, however
 {code}
 a = load 'thing' as (x:int);
 b = foreach a generate *;
 c = join a by x, b by x;
 {code}
 oh and here is the log:
 {code}
 $ cat pig_1333487146863.log
 Pig Stack Trace
 ---
 ERROR 1200: Pig script failed to parse: 
 line 3, column 4 pig script failed to validate: 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection 
 with nothing to reference!
 Failed to parse: Pig script failed to parse: 
 line 3, column 4 pig script failed to validate: 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection 
 with nothing to reference!
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:182)
   at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1566)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1539)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:541)
   at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:945)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:392)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
   at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
   at org.apache.pig.Main.run(Main.java:535)
   at org.apache.pig.Main.main(Main.java:153)
 Caused by: 
 line 3, column 4 pig script failed to validate: 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection 
 with nothing to reference!
   at 
 org.apache.pig.parser.LogicalPlanBuilder.buildJoinOp(LogicalPlanBuilder.java:363)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.join_clause(LogicalPlanGenerator.java:11441)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1491)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:791)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:509)
   at 
 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:384)
   at 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:175)
   ... 10 more
 
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request 14274: PIG-2672 Optimize the use of DistributedCache

2013-09-25 Thread Aniket Mokashi


 On Sept. 24, 2013, 6:30 a.m., Cheolsoo Park wrote:
  trunk/test/org/apache/pig/test/TestJobControlCompiler.java, line 161
  https://reviews.apache.org/r/14274/diff/1/?file=355177#file355177line161
 
  The following line is missing in the RB diff but it's in the attached 
  the patch:
  
  properties.setProperty(PigConstants.PIG_SHARED_CACHE_ENABLED_KEY, 
  true);
  
  Just pointing it out.

I realized that we do not need to have PIG_SHARED_CACHE_ENABLED_KEY property 
for this. So, I removed this unnecessary property from the RB. I will attach 
the patch with the changes soon.


- Aniket


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14274/#review26342
---


On Sept. 21, 2013, 1:21 a.m., Aniket Mokashi wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/14274/
 ---
 
 (Updated Sept. 21, 2013, 1:21 a.m.)
 
 
 Review request for pig, Cheolsoo Park, DanielWX DanielWX, Dmitriy Ryaboy, 
 Julien Le Dem, and Rohini Palaniswamy.
 
 
 Bugs: PIG-2672
 https://issues.apache.org/jira/browse/PIG-2672
 
 
 Repository: pig
 
 
 Description
 ---
 
 added jar.cache.location option
 
 
 Diffs
 -
 
   trunk/src/org/apache/pig/PigConstants.java 1525188 
   
 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  1525188 
   trunk/src/org/apache/pig/impl/PigContext.java 1525188 
   trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1525188 
   trunk/test/org/apache/pig/test/TestJobControlCompiler.java 1525188 
 
 Diff: https://reviews.apache.org/r/14274/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Aniket Mokashi
 




Re: Review Request 14274: PIG-2672 Optimize the use of DistributedCache

2013-09-25 Thread Aniket Mokashi


 On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote:
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
   line 1492
  https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1492
 
  If hdfs path use as is and do not ship to jar cache. It will also save 
  time and hash checks.

Currently, PigServer#registerJar localizes all the jars. So, this would need 
some more refactor before we can do this. I will try to solve this in a 
separate jira.


 On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote:
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
   line 1495
  https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1495
 
  Since the name of the file on hdfs is different from that of the actual 
  file, create a symlink with the actual filename. Some users might depend on 
  the actual file name.
 
 Rohini Palaniswamy wrote:
 One case I see is python scripts(jython UDFs) which do imports based on 
 the file name. Would be the same for other scripting languages that we 
 support. It would be good to run the full unit and e2e test with your patch 
 before going for a commit

May be I should avoid renaming the files and just put them under 
/a/b/c/abcdefsha1/udf.jar.


 On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote:
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java,
   line 1509
  https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1509
 
  First do a file size comparison before calculating checksum for better 
  efficiency

Size check would require stat calls to nn, this being local should be quicker 
than that.


- Aniket


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14274/#review26364
---


On Sept. 21, 2013, 1:21 a.m., Aniket Mokashi wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/14274/
 ---
 
 (Updated Sept. 21, 2013, 1:21 a.m.)
 
 
 Review request for pig, Cheolsoo Park, DanielWX DanielWX, Dmitriy Ryaboy, 
 Julien Le Dem, and Rohini Palaniswamy.
 
 
 Bugs: PIG-2672
 https://issues.apache.org/jira/browse/PIG-2672
 
 
 Repository: pig
 
 
 Description
 ---
 
 added jar.cache.location option
 
 
 Diffs
 -
 
   trunk/src/org/apache/pig/PigConstants.java 1525188 
   
 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
  1525188 
   trunk/src/org/apache/pig/impl/PigContext.java 1525188 
   trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1525188 
   trunk/test/org/apache/pig/test/TestJobControlCompiler.java 1525188 
 
 Diff: https://reviews.apache.org/r/14274/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Aniket Mokashi
 




[jira] [Updated] (PIG-2631) Pig should allow self joins

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2631:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Pig should allow self joins
 ---

 Key: PIG-2631
 URL: https://issues.apache.org/jira/browse/PIG-2631
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
 Fix For: 0.13.0


 This doesn't have to even be optimized, and can still involve a double scan 
 of the data, but there is no reason the following should work:
 {code}
 a = load 'thing' as (x:int);
 b = join a by x, (foreach a generate *) by x;
 {code}
 but this does not:
 {code}
 a = load 'thing' as (x:int);
 b = join a by x, a by x;
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2633) Create a SchemaBag which generates a Bag with a known Schema via code gen

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2633:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Create a SchemaBag which generates a Bag with a known Schema via code gen
 -

 Key: PIG-2633
 URL: https://issues.apache.org/jira/browse/PIG-2633
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.13.0


 This is related to PIG-2632. The idea is to also extend that and create a 
 known version based on a given Schema.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-09-25 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778025#comment-13778025
 ] 

Rohini Palaniswamy commented on PIG-2417:
-

+1 for PIG-2417-unicode.patch. That worked. Committed to 0.12 and trunk. Thanks 
Jeremy.

 Streaming UDFs -  allow users to easily write UDFs in scripting languages 
 with no JVM implementation.
 -

 Key: PIG-2417
 URL: https://issues.apache.org/jira/browse/PIG-2417
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.12.0
Reporter: Jeremy Karn
Assignee: Jeremy Karn
 Fix For: 0.12.0

 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, 
 PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, 
 PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, 
 streaming2.patch, streaming3.patch, streaming.patch


 The goal of Streaming UDFs is to allow users to easily write UDFs in 
 scripting languages with no JVM implementation or a limited JVM 
 implementation.  The initial proposal is outlined here: 
 https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
 In order to implement this we need new syntax to distinguish a streaming UDF 
 from an embedded JVM UDF.  I'd propose something like the following (although 
 I'm not sure 'language' is the best term to be using):
 {code}define my_streaming_udfs language('python') 
 ship('my_streaming_udfs.py'){code}
 We'll also need a language-specific controller script that gets shipped to 
 the cluster which is responsible for reading the input stream, deserializing 
 the input data, passing it to the user written script, serializing that 
 script output, and writing that to the output stream.
 Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
 class will likely share some of the existing code in POStream and 
 ExecutableManager (where it make sense to pull out shared code) to stream 
 data to/from the controller script.
 One alternative approach to creating the StreamingUDF EvalFunc is to use the 
 POStream operator directly.  This would involve inserting the POStream 
 operator instead of the POUserFunc operator whenever we encountered a 
 streaming UDF while building the physical plan.  This approach seemed 
 problematic because there would need to be a lot of changes in order to 
 support POStream in all of the places we want to be able use UDFs (For 
 example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2834) MultiStorage requires unused constructor argument

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2834:


Fix Version/s: (was: 0.12.0)
   0.13.0

 MultiStorage requires unused constructor argument
 -

 Key: PIG-2834
 URL: https://issues.apache.org/jira/browse/PIG-2834
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.10.0, 0.11
 Environment: Linux
Reporter: Danny Antonetti
Priority: Trivial
  Labels: newbie
 Fix For: 0.13.0

 Attachments: MultiStorage.patch


 each constructor in
 org.apache.pig.piggybank.storage.MultiStorage
 requires a constructor argument 'parentPathStr, that has no meaningful usage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2880) Pig current releases lack a UDF charAt.This UDF returns the char value at the specified index.

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778036#comment-13778036
 ] 

Daniel Dai commented on PIG-2880:
-

[~sunitha muralidharan], are you still working on it?

 Pig current releases lack a UDF charAt.This UDF returns the char value at the 
 specified index.
 --

 Key: PIG-2880
 URL: https://issues.apache.org/jira/browse/PIG-2880
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: Sabir Ayappalli
  Labels: patch
 Fix For: 0.12.0

 Attachments: CharAt.java.patch


 Pig current releases lack a UDF charAt.This UDF returns the char value at the 
 specified index. An index ranges from 0 to length() - 1. The first char value 
 of the sequence is at index 0, the next at index 1, and so on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2687) Add relation/operator scoping to Pig

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2687:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Add relation/operator scoping to Pig
 

 Key: PIG-2687
 URL: https://issues.apache.org/jira/browse/PIG-2687
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Priority: Minor
 Fix For: 0.13.0


 The idea is to add a real notion of scope that can be used to manage 
 namespace. This would mean the addition of blocks to pig, probably with some 
 sort of syntax like this...
 {code}
 a = load thing as (x:int, y:int);
 b = foreach a generate x, y, x*y as z;
 {
   a = group b by z;
   b = foreach a generate COUNT(b);
   global b;
 }
 {code}
 which would replace the alias b with the nested b value in the scope. This 
 could also be used in nested foreach blocks, and macros could just become 
 blocks as well.
 I am 95% sure about how to implement this... I have a failed patch attempt, 
 and need to study a bit more about how Pig uses its logical operators.
 Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778037#comment-13778037
 ] 

Daniel Dai commented on PIG-2641:
-

[~russell.jurney], are you still working on it?

 Create toJSON function for all complex types: tuples, bags and maps
 ---

 Key: PIG-2641
 URL: https://issues.apache.org/jira/browse/PIG-2641
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.12.0
 Environment: Foggy. Damn foggy.
Reporter: Russell Jurney
Assignee: Russell Jurney
  Labels: chararray, fun, happy, input, json, output, pants, pig, 
 piggybank, string, wonderdog
 Fix For: 0.12.0

 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, 
 PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 It is a travesty that there are no UDFs in Piggybanks that, given an 
 arbitrary Pig datatype, return a JSON string of same. I intend to fix this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2681) TestDriverPig.countStores() does not correctly count the number of stores for pig scripts using variables for the alias

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2681:


Fix Version/s: (was: 0.12.0)
   0.13.0

 TestDriverPig.countStores() does not correctly count the number of stores for 
 pig scripts using variables for the alias
 ---

 Key: PIG-2681
 URL: https://issues.apache.org/jira/browse/PIG-2681
 Project: Pig
  Issue Type: Test
  Components: e2e harness
Affects Versions: 0.9.0, 0.9.1, 0.9.2, 0.10.0
Reporter: Araceli Henley
 Fix For: 0.13.0

 Attachments: PIG-2681.patch


 For  pig macros where the out parameter is referenced in a store statement, 
 the TestDriveP.countStores() does not correctly count the number of stores:
 For example, the store will not be counted in :
 define myMacro(in1,in2) returns A {
  A  = load '$in1' using PigStorage('$delimeter') as (intnum1000: int,id: 
 int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: 
 float,doublenum: double);
store $A into '$out';
 }
  countStores() matches with:
  $count += $q[$i] =~ /store\s+[a-zA-Z][a-zA-Z0-9_]*\s+into/i;
 Since the alias has a special character $ it doesn't count it and the test 
 fails.
 Need to change this to:
$count += $q[$i] =~ /store\s+(\$)?[a-zA-Z][a-zA-Z0-9_]*\s+into/i;
 I'll submit a patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2927) SHIP and use JRuby gems in JRuby UDFs

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2927:


Fix Version/s: (was: 0.12.0)
   0.13.0

 SHIP and use JRuby gems in JRuby UDFs
 -

 Key: PIG-2927
 URL: https://issues.apache.org/jira/browse/PIG-2927
 Project: Pig
  Issue Type: New Feature
  Components: parser
Affects Versions: 0.11
 Environment: JRuby UDFs
Reporter: Russell Jurney
Assignee: Jonathan Coveney
Priority: Minor
 Fix For: 0.13.0

 Attachments: PIG-2927-0.patch, PIG-2927-1.patch, PIG-2927-2.patch, 
 PIG-2927-3.patch, PIG-2927-4.patch


 It would be great to use JRuby gems in JRuby UDFs without installing them on 
 all machines on the cluster. Some way to SHIP them automatically with the job 
 would be great.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3008) Fix whitespace in Pig code

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3008:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Fix whitespace in Pig code
 --

 Key: PIG-3008
 URL: https://issues.apache.org/jira/browse/PIG-3008
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
 Fix For: 0.13.0

 Attachments: checkstyle.xml


 This JIRA exists mainly to get a conversation started. We've talked about it 
 before, and it's a tricky issue. That said, some of the Pig code is super, 
 super gnarly. We need some sort of path that will let it eventually be 
 fix-able.
 I posit: any file that hasn't been touched for over 6 months is eligible for 
 a whitespace patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3055) Make it possible to register new script engines

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3055:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Make it possible to register new script engines
 ---

 Key: PIG-3055
 URL: https://issues.apache.org/jira/browse/PIG-3055
 Project: Pig
  Issue Type: Improvement
Reporter: Greg Bowyer
Assignee: Greg Bowyer
 Fix For: 0.13.0

 Attachments: 
 PIG-3055-Make-it-possible-to-register-a-script-engine.patch


 Hi shiny pig people
 I have recently been playing about with getting renjin to work as a script 
 engine in pig in the same manner as jython / ruby etc.
 Renjin is a re-implementation of R in java.
 For now the renjin project is in its infancy and is probably not best suited 
 to being bundled with pig, as such I need to be able to extend the 
 ScriptEngine interface and register renjin as a suitable engine for pig to 
 use.
 At present the parts of pig that know about script engines are not easily 
 changed, attached is a patch that should make this possible.
 Thoughts ? ideas ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3010) Allow UDF's to flatten themselves

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778045#comment-13778045
 ] 

Daniel Dai commented on PIG-3010:
-

[~jcoveney], are you still working on it?

 Allow UDF's to flatten themselves
 -

 Key: PIG-3010
 URL: https://issues.apache.org/jira/browse/PIG-3010
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12.0

 Attachments: PIG-3010-0.patch, PIG-3010-1.patch, 
 PIG-3010-2_nowhitespace.patch, PIG-3010-2.patch, PIG-3010-3_nows.patch, 
 PIG-3010-3.patch, PIG-3010-4_nows.patch, PIG-3010-4.patch, 
 PIG-3010-5_nows.patch, PIG-3010-5.patch


 This is something I thought would be cool for a while, so I sat down and did 
 it because I think there are some useful debugging tools it'd help with.
 The idea is that if you attach an annotation to a UDF, the Tuple or DataBag 
 you output will be flattened. This is quite powerful. A very common pattern 
 is:
 a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c);
 This would let you just do:
 a = foreach data generate MyUdf(thing);
 With the exact same result!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3038) Support for Credentials for UDF,Loader and Storer

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3038:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Support for Credentials for UDF,Loader and Storer
 -

 Key: PIG-3038
 URL: https://issues.apache.org/jira/browse/PIG-3038
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.10.0
Reporter: Rohini Palaniswamy
 Fix For: 0.13.0


   Pig does not have a clean way (APIs) to support adding Credentials (hbase 
 token, hcat/hive metastore token) to Job and retrieving it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps

2013-09-25 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778061#comment-13778061
 ] 

Russell Jurney commented on PIG-2641:
-

How long do I have to get this into 0.12? Is that still possible?

 Create toJSON function for all complex types: tuples, bags and maps
 ---

 Key: PIG-2641
 URL: https://issues.apache.org/jira/browse/PIG-2641
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.12.0
 Environment: Foggy. Damn foggy.
Reporter: Russell Jurney
Assignee: Russell Jurney
  Labels: chararray, fun, happy, input, json, output, pants, pig, 
 piggybank, string, wonderdog
 Fix For: 0.12.0

 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, 
 PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 It is a travesty that there are no UDFs in Piggybanks that, given an 
 arbitrary Pig datatype, return a JSON string of same. I intend to fix this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778069#comment-13778069
 ] 

Daniel Dai commented on PIG-2641:
-

Since 0.12 is already branched, we don't suppose to commit new features. Can we 
defer to 0.13.0?

 Create toJSON function for all complex types: tuples, bags and maps
 ---

 Key: PIG-2641
 URL: https://issues.apache.org/jira/browse/PIG-2641
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.12.0
 Environment: Foggy. Damn foggy.
Reporter: Russell Jurney
Assignee: Russell Jurney
  Labels: chararray, fun, happy, input, json, output, pants, pig, 
 piggybank, string, wonderdog
 Fix For: 0.12.0

 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, 
 PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 It is a travesty that there are no UDFs in Piggybanks that, given an 
 arbitrary Pig datatype, return a JSON string of same. I intend to fix this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3404:
--

Description: 
`There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one 
or more files in that directory are bad files (for example, corrupted or bad 
data caused by compression).
* A directory is used as the input of a load operation. The current user may 
not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It 
would be useful to have option to allow the job to continue and ignore the bad 
files or inaccessible files/folders without abort the job, ideally, log or 
print a warning for such error or violations. This requirement is not trivial 
because for big data set for large analytics applications, this is not always 
possible to sort out the  good data for processing; Ignore a few of bad files 
may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage 
and related file format in Pig already has this flag but it is not complete to 
cover all the cases mentioned above. We would improve the PigStorage and 
related text format to support this new flag as well as improve AvroStorage and 
related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be 
set for each load operation respectively. The value of this flag will be false 
if it is not explicitly set. Ideally, we can provide a global pig parameter 
which forces the default value to true for all load functions even if it is not 
explicitly set in the LOAD statement.


  was:
There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one 
or more files in that directory are bad files (for example, corrupted or bad 
data caused by compression).
* A directory is used as the input of a load operation. The current user may 
not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It 
would be useful to have option to allow the job to continue and ignore the bad 
files or inaccessible files/folders without abort the job, ideally, log or 
print a warning for such error or violations. This requirement is not trivial 
because for big data set for large analytics applications, this is not always 
possible to sort out the  good data for processing; Ignore a few of bad files 
may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage 
and related file format in Pig already has this flag but it is not complete to 
cover all the cases mentioned above. We would improve the PigStorage and 
related text format to support this new flag as well as improve AvroStorage and 
related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be 
set for each load operation respectively. The value of this flag will be false 
if it is not explicitly set. Ideally, we can provide a global pig parameter 
which forces the default value to true for all load functions even if it is not 
explicitly set in the LOAD statement.



 Improve Pig to ignore bad files or inaccessible files or folders
 

 Key: PIG-3404
 URL: https://issues.apache.org/jira/browse/PIG-3404
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: PIG-3404.patch


 `There are use cases in Pig:
 * A directory is used as the input of a load operation. It is possible that 
 one or more files in that directory are bad files (for example, corrupted or 
 bad data caused by compression).
 * A directory is used as the input of a load operation. The current user may 
 not have permission to access any subdirectories or files of that directory.
 The current Pig implementation will abort the whole Pig job for such cases. 
 It would be useful to have option to allow the job to continue and ignore the 
 bad files or inaccessible files/folders without abort the job, ideally, log 
 or print a warning for such error or violations. This requirement is not 
 trivial because for big data set for large analytics applications, this is 
 not always possible to sort out the  good data for processing; Ignore a few 
 of bad files may be a better choice for such situations.
 We propose to use “Ignore bad files” flag to address this problem. 
 AvroStorage and related file format in Pig already has this flag but it is 
 not complete to cover all the cases 

[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3404:
--

Description: 
There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one 
or more files in that directory are bad files (for example, corrupted or bad 
data caused by compression).
* A directory is used as the input of a load operation. The current user may 
not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It 
would be useful to have option to allow the job to continue and ignore the bad 
files or inaccessible files/folders without abort the job, ideally, log or 
print a warning for such error or violations. This requirement is not trivial 
because for big data set for large analytics applications, this is not always 
possible to sort out the  good data for processing; Ignore a few of bad files 
may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage 
and related file format in Pig already has this flag but it is not complete to 
cover all the cases mentioned above. We would improve the PigStorage and 
related text format to support this new flag as well as improve AvroStorage and 
related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be 
set for each load operation respectively. The value of this flag will be false 
if it is not explicitly set. Ideally, we can provide a global pig parameter 
which forces the default value to true for all load functions even if it is not 
explicitly set in the LOAD statement.


  was:
`There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one 
or more files in that directory are bad files (for example, corrupted or bad 
data caused by compression).
* A directory is used as the input of a load operation. The current user may 
not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It 
would be useful to have option to allow the job to continue and ignore the bad 
files or inaccessible files/folders without abort the job, ideally, log or 
print a warning for such error or violations. This requirement is not trivial 
because for big data set for large analytics applications, this is not always 
possible to sort out the  good data for processing; Ignore a few of bad files 
may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage 
and related file format in Pig already has this flag but it is not complete to 
cover all the cases mentioned above. We would improve the PigStorage and 
related text format to support this new flag as well as improve AvroStorage and 
related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be 
set for each load operation respectively. The value of this flag will be false 
if it is not explicitly set. Ideally, we can provide a global pig parameter 
which forces the default value to true for all load functions even if it is not 
explicitly set in the LOAD statement.



 Improve Pig to ignore bad files or inaccessible files or folders
 

 Key: PIG-3404
 URL: https://issues.apache.org/jira/browse/PIG-3404
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: PIG-3404.patch


 There are use cases in Pig:
 * A directory is used as the input of a load operation. It is possible that 
 one or more files in that directory are bad files (for example, corrupted or 
 bad data caused by compression).
 * A directory is used as the input of a load operation. The current user may 
 not have permission to access any subdirectories or files of that directory.
 The current Pig implementation will abort the whole Pig job for such cases. 
 It would be useful to have option to allow the job to continue and ignore the 
 bad files or inaccessible files/folders without abort the job, ideally, log 
 or print a warning for such error or violations. This requirement is not 
 trivial because for big data set for large analytics applications, this is 
 not always possible to sort out the  good data for processing; Ignore a few 
 of bad files may be a better choice for such situations.
 We propose to use “Ignore bad files” flag to address this problem. 
 AvroStorage and related file format in Pig already has this flag but it is 
 not complete to cover all the cases 

[jira] [Commented] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders

2013-09-25 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778119#comment-13778119
 ] 

Koji Noguchi commented on PIG-3404:
---

(I accidentally updated the description. sorry for the spam~) 

For this issue, can we use 
mapred.max.map.failures.percent (or mapreduce.map.failures.maxpercent in 2.*)?



 Improve Pig to ignore bad files or inaccessible files or folders
 

 Key: PIG-3404
 URL: https://issues.apache.org/jira/browse/PIG-3404
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.11.2
Reporter: Jerry Chen
  Labels: Rhino
 Attachments: PIG-3404.patch


 There are use cases in Pig:
 * A directory is used as the input of a load operation. It is possible that 
 one or more files in that directory are bad files (for example, corrupted or 
 bad data caused by compression).
 * A directory is used as the input of a load operation. The current user may 
 not have permission to access any subdirectories or files of that directory.
 The current Pig implementation will abort the whole Pig job for such cases. 
 It would be useful to have option to allow the job to continue and ignore the 
 bad files or inaccessible files/folders without abort the job, ideally, log 
 or print a warning for such error or violations. This requirement is not 
 trivial because for big data set for large analytics applications, this is 
 not always possible to sort out the  good data for processing; Ignore a few 
 of bad files may be a better choice for such situations.
 We propose to use “Ignore bad files” flag to address this problem. 
 AvroStorage and related file format in Pig already has this flag but it is 
 not complete to cover all the cases mentioned above. We would improve the 
 PigStorage and related text format to support this new flag as well as 
 improve AvroStorage and related facilities to completely support the concept.
 The flag is “Storage” (For example, PigStorage or AvroStorage) based and can 
 be set for each load operation respectively. The value of this flag will be 
 false if it is not explicitly set. Ideally, we can provide a global pig 
 parameter which forces the default value to true for all load functions even 
 if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory.

2013-09-25 Thread Raviteja Chirala (JIRA)
Raviteja Chirala created PIG-3482:
-

 Summary: Mapper only Jobs are not creating intermediate files in 
/tmp/, instead creating in user directory. 
 Key: PIG-3482
 URL: https://issues.apache.org/jira/browse/PIG-3482
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
 Environment: RHEL 6.0
Reporter: Raviteja Chirala
Priority: Minor
 Fix For: 0.12.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory.

2013-09-25 Thread Raviteja Chirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raviteja Chirala updated PIG-3482:
--

Description: When we run Mapper only jobs, All the intermediate 
outputs(compressed) are going to the user directory instead of going to tmp. If 
we run on small datasets, it shouldn't create a problem. But when I run for 
large datasets like more than 100TB lets say, it taking up so much disk space 
exceeding the disk space quota(setSpaceQuota) of 100GB also. Problem is 
happening before clean up. 

 Mapper only Jobs are not creating intermediate files in /tmp/, instead 
 creating in user directory. 
 ---

 Key: PIG-3482
 URL: https://issues.apache.org/jira/browse/PIG-3482
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
 Environment: RHEL 6.0
Reporter: Raviteja Chirala
Priority: Minor
 Fix For: 0.12.1


 When we run Mapper only jobs, All the intermediate outputs(compressed) are 
 going to the user directory instead of going to tmp. If we run on small 
 datasets, it shouldn't create a problem. But when I run for large datasets 
 like more than 100TB lets say, it taking up so much disk space exceeding the 
 disk space quota(setSpaceQuota) of 100GB also. Problem is happening before 
 clean up. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead of creating in user directory.

2013-09-25 Thread Raviteja Chirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raviteja Chirala updated PIG-3482:
--

Summary: Mapper only Jobs are not creating intermediate files in /tmp/, 
instead of creating in user directory.   (was: Mapper only Jobs are not 
creating intermediate files in /tmp/, instead creating in user directory. )

 Mapper only Jobs are not creating intermediate files in /tmp/, instead of 
 creating in user directory. 
 --

 Key: PIG-3482
 URL: https://issues.apache.org/jira/browse/PIG-3482
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
 Environment: RHEL 6.0
Reporter: Raviteja Chirala
Priority: Minor
 Fix For: 0.12.1


 When we run Mapper only jobs, All the intermediate outputs(compressed) are 
 going to the user directory instead of going to tmp. If we run on small 
 datasets, it shouldn't create a problem. But when I run for large datasets 
 like more than 100TB lets say, it taking up so much disk space exceeding the 
 disk space quota(setSpaceQuota) of 100GB also. Problem is happening before 
 clean up. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3083) Introduce new syntax that let's you project just the columns that come from a given :: prefix

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3083:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Introduce new syntax that let's you project just the columns that come from a 
 given :: prefix
 -

 Key: PIG-3083
 URL: https://issues.apache.org/jira/browse/PIG-3083
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Jonathan Coveney
  Labels: PIG-3078
 Fix For: 0.13.0

 Attachments: pig_jira_aguin_3083.patch


 This is basically a more refined approach than PIG-3078, but it is also more 
 work. That JIRA is more of a stopgap until we do something like this.
 The idea would be to support something like the following:
 a = load 'a' as (x,y,z);
 b = load 'b'  as (x,y,z);
 c = join a by x, b by x;
 d = foreach c generate a::*;
 Obviously this is useful for any case where you have relations with columns 
 with various prefixes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3087) Refactor TestLogicalPlanBuilder to be meaningful

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3087:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Refactor TestLogicalPlanBuilder to be meaningful
 

 Key: PIG-3087
 URL: https://issues.apache.org/jira/browse/PIG-3087
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
  Labels: newbie
 Fix For: 0.13.0

 Attachments: PIG-3087-0.patch


 I started doing this as part of another patch, but there are some bigger 
 issues, and I don't have the time to dig in atm.
 That said, a lot of the tests as written don't test anything. I used more 
 modern junit patterns, and discovered we had a lot of tests that weren't 
 functioning properly. Making them function properly unveiled that the general 
 buildLp pattern doesn't work properly anymore for many cases where it would 
 throw an error in grunt, but for whatever reason no error is thrown in the 
 tests.
 Any test with _1 is a test that previous failed, that now doesn't. Some, 
 however, don't make sense so I think what really needs to be done is figure 
 out which should be failing, which shouldn't, and then fix buildLp 
 accordingly.
 I will attach my pass at it, but it is incomplete and needs work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3104) XMLLoader return Pig tuple/map/bag representation of the DOM of XML documents

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3104:


Fix Version/s: (was: 0.12.0)
   0.13.0

 XMLLoader return Pig tuple/map/bag representation of the DOM of XML documents
 -

 Key: PIG-3104
 URL: https://issues.apache.org/jira/browse/PIG-3104
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs, piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Russell Jurney
Assignee: Daniel Dai
 Fix For: 0.13.0


 I want to extend Pig's existing XMLLoader to go beyond capturing the text 
 inside a tag and to actually create a Pig mapping of the Document Object 
 Model the XML represents. This would be similar to elephant-bird's 
 JsonLoader. Semi-structured data can vary, so this behavior can be risky 
 but... I want people to be able to load JSON and XML data easily their first 
 session with Pig.
 ---
 characters = load 'example.xml' using XMLLoader('character');
 describe characters
  
 {properties:map[], name:chararray, born:datetime, qualification:chararray}
 ---
   book id=b0836217462 available=true
 isbn
   0836217462
 /isbn
 title lang=en
   Being a Dog Is a Full-Time Job
 /title
 author id=CMS
   name
 Charles M Schulz
   /name
   born
 1922-11-26
   /born
   dead
 2000-02-12
   /dead
 /author
 character id=PP
   name
 Peppermint Patty
   /name
   born
 1966-08-22
   /born
   qualification
 bold, brash and tomboyish
   /qualification
 /character
 character id=Snoopy
   name
 Snoopy
   /name
   born
 1950-10-04
   /born
   qualification
 extroverted beagle
   /qualification
 /character
 character id=Schroeder
   name
 Schroeder
   /name
   born
 1951-05-30
   /born
   qualification
 brought classical music to the Peanuts strip
   /qualification
 /character
 character id=Lucy
   name
 Lucy
   /name
   born
 1952-03-03
   /born
   qualification
 bossy, crabby and selfish
   /qualification
 /character
   /book
 /library

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag

2013-09-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778177#comment-13778177
 ] 

Daniel Dai commented on PIG-3377:
-

[~jadler], are you still working on it?

 New AvroStorage throws NPE when storing untyped map/array/bag
 -

 Key: PIG-3377
 URL: https://issues.apache.org/jira/browse/PIG-3377
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Reporter: Cheolsoo Park
Assignee: Joseph Adler
 Fix For: 0.12.0


 The following example demonstrates the issue:
 {code}
 a = LOAD 'foo' AS (m:map[]);
 STORE a INTO 'bar' USING AvroStorage();
 {code}
 This fails with the following error:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462)
 at 
 org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335)
 at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472)
 {code}
 Similarly, untyped bag causes the following error:
 {code}
 Caused by: java.lang.NullPointerException
 at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722)
 ...
 at org.apache.avro.Schema.getElementType(Schema.java:256)
 at 
 org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491)
 {code}
 The problem is that AvroStorage cannot derive the output schema from untyped 
 map/bag/tuple. When type is not defined, it should be assumed as bytearray.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3326) Add PiggyBank to Maven Repository

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3326:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Add PiggyBank to Maven Repository
 -

 Key: PIG-3326
 URL: https://issues.apache.org/jira/browse/PIG-3326
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: Aaron Mitchell
Priority: Minor
 Fix For: 0.13.0


 PiggyBank should be uploaded to the apache maven repository.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3254) Fail a failed Pig script quicker

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3254:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Fail a failed Pig script quicker
 

 Key: PIG-3254
 URL: https://issues.apache.org/jira/browse/PIG-3254
 Project: Pig
  Issue Type: Improvement
Reporter: Daniel Dai
 Fix For: 0.13.0


 Credit to [~asitecn]. Currently Pig could launch several mapreduce job 
 simultaneously. When one mapreduce job fail, we need to wait for simultaneous 
 mapreduce job finish. In addition, we could potentially launch additional 
 jobs which is doomed to fail. However, this is unnecessary in some cases:
 * If stop.on.failure==true, we can kill parallel jobs, and fail the whole 
 script
 * If stop.on.failure==false, and no store could success, we can also kill 
 parallel jobs, and fail the whole script
 Consider simultaneous jobs may take a long time to finish, this could 
 significantly improve the turn around in some cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3232) Refactor Pig so that configurations use PigConfiguration wherever possible

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3232:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Refactor Pig so that configurations use PigConfiguration wherever possible
 --

 Key: PIG-3232
 URL: https://issues.apache.org/jira/browse/PIG-3232
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
 Fix For: 0.13.0




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3143) Enable TOKENIZE to use any configurable Lucene Tokenizer, if a config parameter is set and the JARs included

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3143:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Enable TOKENIZE to use any configurable Lucene Tokenizer, if a config 
 parameter is set and the JARs included
 

 Key: PIG-3143
 URL: https://issues.apache.org/jira/browse/PIG-3143
 Project: Pig
  Issue Type: Improvement
  Components: impl, internal-udfs
Affects Versions: 0.11
Reporter: Russell Jurney
 Fix For: 0.13.0


 I'll do this in time for 12. TOKENIZE is literally useless as is. See: 
 http://thedatachef.blogspot.com/2011/04/lucene-text-tokenization-udf-for-apache.html
 https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3157) Move LENGTH from Piggybank to builtin, make LENGTH work for multiple types similar to SIZE

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3157:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Move LENGTH from Piggybank to builtin, make LENGTH work for multiple types 
 similar to SIZE
 --

 Key: PIG-3157
 URL: https://issues.apache.org/jira/browse/PIG-3157
 Project: Pig
  Issue Type: Improvement
  Components: internal-udfs, piggybank
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.13.0


 LENGTH needs to be a builtin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3133) Revamp algebraic interface to actually return classes

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3133:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Revamp algebraic interface to actually return classes
 -

 Key: PIG-3133
 URL: https://issues.apache.org/jira/browse/PIG-3133
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
 Fix For: 0.13.0


 The current algebraic interface is a bit weird to work with. It would make a 
 lot more sense to let people return Class? extends EvalFuncTuple or what 
 have you, or even a FuncSpec, but the current string based approach 
 circumvents the whole point of using Java and is annoying. I think we should 
 have an abstract EFInitial, EFIntermediate, EFFinal which implemented the 
 exec function for the user, but in terms of a simpler, clearer interface. 
 This way if people really want the old way they can, but we can present them 
 something less ugly.
 This would also be a good time to clarify the contracts of Algebraics and 
 simplify them (the initial function's a tuple which contains a bag which 
 contains 1 tuple is super whack).
 If anyone wants to work on this let me know because this is the sort of thing 
 I will probably bang out when procrastinating something else.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3146) Can't 'import re' in Pig 0.10/0.10.1: ImportError: No module named re

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3146:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Can't 'import re' in Pig 0.10/0.10.1: ImportError: No module named re
 -

 Key: PIG-3146
 URL: https://issues.apache.org/jira/browse/PIG-3146
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0, 0.10.1
Reporter: Russell Jurney
 Fix For: 0.13.0


 Caused by: Traceback (most recent call last):
   File udfs.py, line 20, in module
 import re
 ImportError: No module named re

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3176) Pig can't use $HOME in Grunt

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3176:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Pig can't use $HOME in Grunt
 

 Key: PIG-3176
 URL: https://issues.apache.org/jira/browse/PIG-3176
 Project: Pig
  Issue Type: Bug
  Components: grunt, parser
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.13.0


 Pig needs to know the user's home directory, to let this easily be set, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3165) sh command cannot run mongo client

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3165:


Fix Version/s: (was: 0.12.0)
   0.13.0

 sh command cannot run mongo client
 --

 Key: PIG-3165
 URL: https://issues.apache.org/jira/browse/PIG-3165
 Project: Pig
  Issue Type: Bug
  Components: grunt, tools
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
Priority: Critical
 Fix For: 0.13.0


 One often needs to drop an old MongoDB store when replacing it. Ex:
 store answer into 'mongodb://localhost/agile_data.hourly_from_reply_probs' 
 using MongoStorage();
 Before doing that you would likely want to run a mongo command from bash from 
 grunt, to drop it:
 sh mongo --eval 'db.hourly_from_reply_probs.drop();'
 However, in this case grunt acts as though the command never returns. Crap!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3177) Fix Pig project SEO so latest, 0.11 docs show when you google things

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3177:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Fix Pig project SEO so latest, 0.11 docs show when you google things
 

 Key: PIG-3177
 URL: https://issues.apache.org/jira/browse/PIG-3177
 Project: Pig
  Issue Type: Bug
  Components: site
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
Priority: Critical
 Fix For: 0.13.0


 http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html
 The 0.7.0 docs are what everyone references. FOR POOPS SAKES.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3111) ToAvro to convert any Pig record to an Avro bytearray

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3111:


Fix Version/s: (was: 0.12.0)
   0.13.0

 ToAvro to convert any Pig record to an Avro bytearray
 -

 Key: PIG-3111
 URL: https://issues.apache.org/jira/browse/PIG-3111
 Project: Pig
  Issue Type: New Feature
  Components: data, internal-udfs
Affects Versions: 0.12.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.13.0


 I want to create a ToAvro() builtin that converts arbitrary pig fields, 
 including complex types (bags, tuples, maps) to avro format as bytearrays.
 This would enable storing Avro records in arbitrary data stores, for example 
 HBaseAvroStorage in PIG-2889
 See PIG-2641 for ToJson
 This points to a greater need for customizable/pluggable serialization that 
 plugin to storefuncs and do serialization independently. For example, we 
 might do these operations:
 a = load 'my_data' as (some_schema);
 b = foreach a generate ToJson(*);
 c = foreach a generate ToAvro(*);
 store b into 'hbase://JsonValueTable' using HBaseStorage(...);
 store c into 'hbase://AvroValueTable' using HBaseStorage(...);
 I'll make a ticket for pluggable serialization separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3214) New/improved mascot

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3214:


Fix Version/s: (was: 0.12.0)
   0.13.0

 New/improved mascot
 ---

 Key: PIG-3214
 URL: https://issues.apache.org/jira/browse/PIG-3214
 Project: Pig
  Issue Type: Wish
  Components: site
Affects Versions: 0.11
Reporter: Andrew Musselman
Priority: Minor
 Fix For: 0.13.0

 Attachments: apache-pig-14.png, apache-pig-yellow-logo.png, 
 newlogo1.png, newlogo2.png, newlogo3.png, newlogo4.png, newlogo5.png, 
 new_logo_7.png, pig_6.JPG, pig_6_lc_g.JPG, pig-logo-10.png, pig-logo-11.png, 
 pig-logo-12.png, pig-logo-13.png, pig-logo-8a.png, pig-logo-8b.png, 
 pig-logo-9a.png, pig-logo-9b.png, pig_logo_new.png


 Request to change pig mascot to something more graphically appealing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3227) SearchEngineExtractor does not work for bing

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3227:


Fix Version/s: (was: 0.12.0)
   0.13.0

 SearchEngineExtractor does not work for bing
 

 Key: PIG-3227
 URL: https://issues.apache.org/jira/browse/PIG-3227
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.13.0

 Attachments: SearchEngineExtractor_Bing.patch


 org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor
 Extracts a search engine from a URL, but it does not work for Bing

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3190) Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3190:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization
 ---

 Key: PIG-3190
 URL: https://issues.apache.org/jira/browse/PIG-3190
 Project: Pig
  Issue Type: Bug
  Components: internal-udfs
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.13.0

 Attachments: PIG-3190-2.patch, PIG-3190-3.patch, PIG-3190.patch


 TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in 
 lucene, as used by, varaha is much more useful for actual tasks: 
 https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3188) pig.script.submitted.timestamp not always consistent for jobs launched in a given script

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3188:


Fix Version/s: (was: 0.12.0)
   0.13.0

 pig.script.submitted.timestamp not always consistent for jobs launched in a 
 given script
 

 Key: PIG-3188
 URL: https://issues.apache.org/jira/browse/PIG-3188
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Fix For: 0.13.0


 {{pig.script.submitted.timestamp}} is set in 
 {{MapReduceLauncher.launchPig()}} when the a MR plan is launched. Some 
 scripts (i.e. those with an exec in the middle) will cause multiple plans to 
 be launched. In these case jobs launched from the same script can have 
 different {{pig.script.submitted.timestamp}} values, which is a bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3228:


Fix Version/s: (was: 0.12.0)
   0.13.0

 SearchEngineExtractor throws an exception on a malformed URL
 

 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.13.0

 Attachments: SearchEngineExtractor_Malformed.patch


 This UDF throws an exception on any MalformedURLException
 This change is consistent with SearchTermExtractor's handling of 
 MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3229:


Fix Version/s: (was: 0.12.0)
   0.13.0

 SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to 
 log exceptions
 ---

 Key: PIG-3229
 URL: https://issues.apache.org/jira/browse/PIG-3229
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.13.0

 Attachments: SearchEngineExtractor_Counter.patch, 
 SearchTermExtractor_Counter.patch


 SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
 return null
 They should log a counter of those errors
 The patch for SearchEngineExtractor is really only relevant if the following 
 bug is accepted
 https://issues.apache.org/jira/browse/PIG-3228

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3469) Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing.

2013-09-25 Thread Jarek Jarcec Cecho (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Jarcec Cecho reassigned PIG-3469:
---

Assignee: Jarek Jarcec Cecho

 Skewed join can cause unrecoverable NullPointerException when one of its 
 inputs is missing.
 ---

 Key: PIG-3469
 URL: https://issues.apache.org/jira/browse/PIG-3469
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
 Environment: Apache Pig version 0.11.0-cdh4.4.0
 Happens in both local execution environment (os x) and cluster environment 
 (linux)
Reporter: Christon DeWan
Assignee: Jarek Jarcec Cecho

 Run this script in the local execution environment (affects cluster mode too):
 {noformat}
 %declare DATA_EXISTS /tmp/test_data_exists.tsv
 %declare DATA_MISSING /tmp/test_data_missing.tsv
 %declare DUMMY `bash -c '(for (( i=0; \$i  10; i++ )); do echo \$i; done)  
 /tmp/test_data_exists.tsv; true'`
 exists = LOAD '$DATA_EXISTS' AS (a:long);
 missing = LOAD '$DATA_MISSING' AS (a:long);
 missing = FOREACH ( GROUP missing BY a ) GENERATE $0 AS a, COUNT_STAR($1);
 joined = JOIN exists BY a, missing BY a USING 'skewed';
 STORE joined INTO '/tmp/test_out.tsv';
 {noformat}
 Results in NullPointerException which halts entire pig execution, including 
 unrelated jobs. Expected: only dependencies of the error'd LOAD statement 
 should fail. 
 Error:
 {noformat}
 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2017: Internal error creating job configuration.
 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
  ERROR 2017: Internal error creating job configuration.
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:848)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:294)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
   at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
   at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
   at org.apache.pig.PigServer.execute(PigServer.java:1241)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:335)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:604)
   at org.apache.pig.Main.main(Main.java:157)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:868)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:480)
   ... 17 more
 {noformat}
 Script above is as small as I can make it while still reproducing the issue. 
 Removing the group-foreach causes the join to fail harmlessly (not stopping 
 pig execution), as does using the default join. Did not occur on 0.10.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3469) Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing.

2013-09-25 Thread Jarek Jarcec Cecho (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778225#comment-13778225
 ] 

Jarek Jarcec Cecho commented on PIG-3469:
-

I believe that I do have understanding of this issue, will upload patch after 
running all tests.

 Skewed join can cause unrecoverable NullPointerException when one of its 
 inputs is missing.
 ---

 Key: PIG-3469
 URL: https://issues.apache.org/jira/browse/PIG-3469
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
 Environment: Apache Pig version 0.11.0-cdh4.4.0
 Happens in both local execution environment (os x) and cluster environment 
 (linux)
Reporter: Christon DeWan
Assignee: Jarek Jarcec Cecho

 Run this script in the local execution environment (affects cluster mode too):
 {noformat}
 %declare DATA_EXISTS /tmp/test_data_exists.tsv
 %declare DATA_MISSING /tmp/test_data_missing.tsv
 %declare DUMMY `bash -c '(for (( i=0; \$i  10; i++ )); do echo \$i; done)  
 /tmp/test_data_exists.tsv; true'`
 exists = LOAD '$DATA_EXISTS' AS (a:long);
 missing = LOAD '$DATA_MISSING' AS (a:long);
 missing = FOREACH ( GROUP missing BY a ) GENERATE $0 AS a, COUNT_STAR($1);
 joined = JOIN exists BY a, missing BY a USING 'skewed';
 STORE joined INTO '/tmp/test_out.tsv';
 {noformat}
 Results in NullPointerException which halts entire pig execution, including 
 unrelated jobs. Expected: only dependencies of the error'd LOAD statement 
 should fail. 
 Error:
 {noformat}
 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2017: Internal error creating job configuration.
 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
  ERROR 2017: Internal error creating job configuration.
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:848)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:294)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177)
   at org.apache.pig.PigServer.launchPlan(PigServer.java:1266)
   at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251)
   at org.apache.pig.PigServer.execute(PigServer.java:1241)
   at org.apache.pig.PigServer.executeBatch(PigServer.java:335)
   at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
   at org.apache.pig.Main.run(Main.java:604)
   at org.apache.pig.Main.main(Main.java:157)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 Caused by: java.lang.NullPointerException
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:868)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:480)
   ... 17 more
 {noformat}
 Script above is as small as I can make it while still reproducing the issue. 
 Removing the group-foreach causes the join to fail harmlessly (not stopping 
 pig execution), as does using the default join. Did not occur on 0.10.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3452) Framework to fail-fast jobs based on exceptions in UDFs (useful for assert)

2013-09-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-3452:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Framework to fail-fast jobs based on exceptions in UDFs (useful for assert)
 ---

 Key: PIG-3452
 URL: https://issues.apache.org/jira/browse/PIG-3452
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.13.0


 Idea is to add an Exception type to pig that udfs can throw to indicate 
 unexpected, unrecoverable problem. If we see n number of these exceptions, 
 pig client can kill the job and abort itself. n can be configured via 
 configuration properties at runtime.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3421) Script jars should be added to extra jars instead of pig's job.jar

2013-09-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-3421:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Script jars should be added to extra jars instead of pig's job.jar
 --

 Key: PIG-3421
 URL: https://issues.apache.org/jira/browse/PIG-3421
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.13.0


 Currently, for all the script engines, pig adds script jars to pig's job jar 
 even without consulting the skipJars list. Ideally, we should add these to 
 extraJars so that they can benefit from distributed cache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3427) Columns pruning does not work with DereferenceExpression

2013-09-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-3427:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Columns pruning does not work with DereferenceExpression
 

 Key: PIG-3427
 URL: https://issues.apache.org/jira/browse/PIG-3427
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.13.0


 Following script does not push projection-
 {code}
 a = load 'something' as (a0, a1);
 b = group a all;
 c = foreach b generate COUNT(a.a0);
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3300) Optimize partition filter pushdown

2013-09-25 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi resolved PIG-3300.
-

Resolution: Duplicate

 Optimize partition filter pushdown
 --

 Key: PIG-3300
 URL: https://issues.apache.org/jira/browse/PIG-3300
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
Assignee: Aniket Mokashi

 When there is a AND and OR condition involved with combination of partition 
 and non-partition columns like 
 (pcond1 and npcond1) or (pcond2 and npcond2), push the partition filter as 
 (pcond1 or pcond2) to the LoadFunc. We will still be applying the whole 
 filter condition on the loaded data. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps

2013-09-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2641:


Fix Version/s: (was: 0.12.0)
   0.13.0

 Create toJSON function for all complex types: tuples, bags and maps
 ---

 Key: PIG-2641
 URL: https://issues.apache.org/jira/browse/PIG-2641
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.12.0
 Environment: Foggy. Damn foggy.
Reporter: Russell Jurney
Assignee: Russell Jurney
  Labels: chararray, fun, happy, input, json, output, pants, pig, 
 piggybank, string, wonderdog
 Fix For: 0.13.0

 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, 
 PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 It is a travesty that there are no UDFs in Piggybanks that, given an 
 arbitrary Pig datatype, return a JSON string of same. I intend to fix this 
 problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-09-25 Thread jira
Issue Subscription
Filter: PIG patch available (13 issues)

Subscriber: pigdaily

Key Summary
PIG-3470Print configuration variables in grunt
https://issues.apache.org/jira/browse/PIG-3470
PIG-3451EvalFuncT ctor reflection to determine value of type param T is 
brittle
https://issues.apache.org/jira/browse/PIG-3451
PIG-3449Move JobCreationException to 
org.apache.pig.backend.hadoop.executionengine
https://issues.apache.org/jira/browse/PIG-3449
PIG-3441Allow Pig to use default resources from Configuration objects
https://issues.apache.org/jira/browse/PIG-3441
PIG-3434Null subexpression in bincond nullifies outer tuple (or bag)
https://issues.apache.org/jira/browse/PIG-3434
PIG-3388No support for Regex for row filter in 
org.apache.pig.backend.hadoop.hbase.HBaseStorage
https://issues.apache.org/jira/browse/PIG-3388
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3117A debug mode in which pig does not delete temporary files
https://issues.apache.org/jira/browse/PIG-3117
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021
PIG-2672Optimize the use of DistributedCache
https://issues.apache.org/jira/browse/PIG-2672

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384


[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3370:
---

Fix Version/s: (was: 0.13.0)
   0.12.0

 Add New Reserved Keywords To The Pig Docs
 -

 Key: PIG-3370
 URL: https://issues.apache.org/jira/browse/PIG-3370
 Project: Pig
  Issue Type: Task
  Components: documentation, parser
Reporter: Sergey Goder
Assignee: Cheolsoo Park
Priority: Trivial
 Fix For: 0.12.0

 Attachments: PIG-3370-1.patch


 The following are reserved keywords in Pig that are not included in the 11.1 
 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords)
 cube, dense, rank, returns, rollup, void
 Please add to any that I may have overlooked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3370:
---

Status: Patch Available  (was: Open)

 Add New Reserved Keywords To The Pig Docs
 -

 Key: PIG-3370
 URL: https://issues.apache.org/jira/browse/PIG-3370
 Project: Pig
  Issue Type: Task
  Components: documentation, parser
Reporter: Sergey Goder
Assignee: Cheolsoo Park
Priority: Trivial
 Fix For: 0.12.0

 Attachments: PIG-3370-1.patch


 The following are reserved keywords in Pig that are not included in the 11.1 
 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords)
 cube, dense, rank, returns, rollup, void
 Please add to any that I may have overlooked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3370:
---

Attachment: PIG-3370-1.patch

The patch adds dense, returns, rollup, and void to the reserved keywords.

 Add New Reserved Keywords To The Pig Docs
 -

 Key: PIG-3370
 URL: https://issues.apache.org/jira/browse/PIG-3370
 Project: Pig
  Issue Type: Task
  Components: documentation, parser
Reporter: Sergey Goder
Assignee: Cheolsoo Park
Priority: Trivial
 Fix For: 0.13.0

 Attachments: PIG-3370-1.patch


 The following are reserved keywords in Pig that are not included in the 11.1 
 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords)
 cube, dense, rank, returns, rollup, void
 Please add to any that I may have overlooked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3483) Document ASSERT keyword

2013-09-25 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3483:
--

 Summary: Document ASSERT keyword
 Key: PIG-3483
 URL: https://issues.apache.org/jira/browse/PIG-3483
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.12.0
Reporter: Cheolsoo Park
 Fix For: 0.12.0


PIG-3367 added a new keyword ASSERT, so we need to document it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3367) Add assert keyword (operator) in pig

2013-09-25 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778454#comment-13778454
 ] 

Cheolsoo Park commented on PIG-3367:


[~aniket486], would you mind updating the documentation - PIG-3483?

 Add assert keyword (operator) in pig
 

 Key: PIG-3367
 URL: https://issues.apache.org/jira/browse/PIG-3367
 Project: Pig
  Issue Type: New Feature
  Components: parser
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.12.0

 Attachments: PIG-3367-2.patch, PIG-3367.patch


 Assert operator can be used for data validation. With assert you can write 
 script as following-
 {code}
 a = load 'something' as (a0:int, a1:int);
 assert a by a0  0, 'a cant be negative for reasons';
 {code}
 This script will fail if assert is violated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3484) Make the size of pig.script property configurable

2013-09-25 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3484:
--

 Summary: Make the size of pig.script property configurable
 Key: PIG-3484
 URL: https://issues.apache.org/jira/browse/PIG-3484
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.13.0


Some applications (e.g. Lipstick) use the pig.script property to display the 
script. But since its size is limited by a hard-coded max, it's not always 
possible to store an entire script.

It would be nicer if the size of pig.script is configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3390) Make pig working with HBase 0.95

2013-09-25 Thread Ashish Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778475#comment-13778475
 ] 

Ashish Singh commented on PIG-3390:
---

looks like this patch only works with hbase compiled against hadoop1. 
As there is no dependency defined for hbase-hadoop2-compat 

 Make pig working with HBase 0.95
 

 Key: PIG-3390
 URL: https://issues.apache.org/jira/browse/PIG-3390
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.11
Reporter: Jarek Jarcec Cecho
Assignee: Jarek Jarcec Cecho
 Fix For: 0.12.0

 Attachments: PIG-3390.patch, PIG-3390.patch, PIG-3390.patch


 The HBase 0.95 changed API in incompatible way. Following APIs that 
 {{HBaseStorage}} in Pig uses are no longer available:
 * {{Mutation.setWriteToWAL(Boolean)}}
 * {{Scan.write(DataOutput)}}
 Also in addition the HBase is no longer available as one monolithic archive 
 with entire functionality, but was broken down into smaller pieces such as 
 {{hbase-client}}, {{hbase-server}}, ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3484) Make the size of pig.script property configurable

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3484:
---

Attachment: PIG-3484-1.patch

The attached patch adds a new property pig.script.max.size. The default value 
is 10,240.

 Make the size of pig.script property configurable
 -

 Key: PIG-3484
 URL: https://issues.apache.org/jira/browse/PIG-3484
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.13.0

 Attachments: PIG-3484-1.patch


 Some applications (e.g. Lipstick) use the pig.script property to display the 
 script. But since its size is limited by a hard-coded max, it's not always 
 possible to store an entire script.
 It would be nicer if the size of pig.script is configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3484) Make the size of pig.script property configurable

2013-09-25 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3484:
---

Status: Patch Available  (was: Open)

 Make the size of pig.script property configurable
 -

 Key: PIG-3484
 URL: https://issues.apache.org/jira/browse/PIG-3484
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.13.0

 Attachments: PIG-3484-1.patch


 Some applications (e.g. Lipstick) use the pig.script property to display the 
 script. But since its size is limited by a hard-coded max, it's not always 
 possible to store an entire script.
 It would be nicer if the size of pig.script is configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3485) Remove CastUtils.bytesToMap() method from LoadCaster interface

2013-09-25 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3485:
--

 Summary: Remove CastUtils.bytesToMap() method from LoadCaster 
interface
 Key: PIG-3485
 URL: https://issues.apache.org/jira/browse/PIG-3485
 Project: Pig
  Issue Type: Task
  Components: impl
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.13.0


PIG-1876 added typed map and annotated the following method as {{deprecated}} 
in 0.9:
{code}
@Deprecated
public MapString, Object bytesToMap(byte[] b) throws IOException;
{code}
We should remove and replace it with the new method that takes type information:
{code}
public MapString, Object bytesToMap(byte[] b, ResourceFieldSchema 
fieldSchema) throws IOException;
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >