[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders
[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerry Chen updated PIG-3404: Attachment: PIG-3404.patch Patch for reference Improve Pig to ignore bad files or inaccessible files or folders Key: PIG-3404 URL: https://issues.apache.org/jira/browse/PIG-3404 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.11.2 Reporter: Jerry Chen Labels: Rhino Attachments: PIG-3404.patch There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders
[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777186#comment-13777186 ] Jerry Chen commented on PIG-3404: - Hi Park, sorry for the late response and I am glad that we can discuss this topic further in this JIRA. Just as mentioned in the JIRA description, we are taking the approach of “Ignore bad files” flag for each storage. Different storages can be controlled separately instead of a global flag. On the other hand, in our use cases, we also want to take care that the current user may not have permission to access any subdirectories of the input directory, which can be looked as “bad directory” in concept. Another thing is the ignore ratio. We currently take an even simpler approach of “ignore all” or “ignore nothing” using a flag. Just as you mentioned, PIG-3059 uses a threshold to control how many bad input splits can be ignored. This is a good thing. While the question is “How many cases in reality that we need a ratio is not 0 and 1?” I went through the patch in PIG-3059. I was trying to understand how the ratio is controlled globally in a distributed MapReduce task environment. It seems that in InputErrorTracker.java, you use a local variable (numErrors) for error tracking. I may miss something there but it would be very helpful if you can help explain. Thank you too for providing the helpful information and let’s continue the discussion. Improve Pig to ignore bad files or inaccessible files or folders Key: PIG-3404 URL: https://issues.apache.org/jira/browse/PIG-3404 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.11.2 Reporter: Jerry Chen Labels: Rhino Attachments: PIG-3404.patch There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3481) Unable to check name message
zhenyingshan created PIG-3481: - Summary: Unable to check name message Key: PIG-3481 URL: https://issues.apache.org/jira/browse/PIG-3481 Project: Pig Issue Type: Bug Reporter: zhenyingshan I am trying to run a pig script in Java program, I get the following error sometimes but not all the time. Here is the snippet of the program and the exception I've got. I have /user/root directory created in hdfs. -- URL path = getClass().getClassLoader().getResource(cfg/concatall.py); LOG.info(CDNResolve2Hbase: reading concatall.py file from + path.toString()); pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME, CDNResolve2Hbase); pigServer.registerQuery(A = load ' + inputPath + ' using PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, firsttime:chararray, updatetime:chararray);); pigServer.registerCode(path.toString(),jython,myfunc); pigServer.registerQuery(B = foreach A generate myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, SUBSTRING(firsttime,0,8);); outputTable = hbase:// + outputTable; ExecJob job = pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn d:dtime')); - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.PigServer.registerQuery(PigServer.java:529) at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown Source) at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source) at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273) at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264) at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86) at org.quartz.core.JobRunShell.run(JobRunShell.java:203) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520) Caused by: Failed to parse: Pig script failed to parse: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599) ... 15 more Caused by: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) ... 16 more Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138) at org.apache.pig.parser.QueryParserUtils.getCurrentDir(QueryParserUtils.java:91) at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:827) ... 22 more Caused by: java.io.IOException: Filesystem closed
[jira] [Updated] (PIG-3481) Unable to check name message
[ https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenyingshan updated PIG-3481: -- Affects Version/s: 0.11.1 Unable to check name message -- Key: PIG-3481 URL: https://issues.apache.org/jira/browse/PIG-3481 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: zhenyingshan I am trying to run a pig script in Java program, I get the following error sometimes but not all the time. Here is the snippet of the program and the exception I've got. I have /user/root directory created in hdfs. -- URL path = getClass().getClassLoader().getResource(cfg/concatall.py); LOG.info(CDNResolve2Hbase: reading concatall.py file from + path.toString()); pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME, CDNResolve2Hbase); pigServer.registerQuery(A = load ' + inputPath + ' using PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, firsttime:chararray, updatetime:chararray);); pigServer.registerCode(path.toString(),jython,myfunc); pigServer.registerQuery(B = foreach A generate myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, SUBSTRING(firsttime,0,8);); outputTable = hbase:// + outputTable; ExecJob job = pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn d:dtime')); - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.PigServer.registerQuery(PigServer.java:529) at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown Source) at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source) at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273) at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264) at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86) at org.quartz.core.JobRunShell.run(JobRunShell.java:203) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520) Caused by: Failed to parse: Pig script failed to parse: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599) ... 15 more Caused by: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) ... 16 more Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:138) at
[jira] [Updated] (PIG-3481) Unable to check name message
[ https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenyingshan updated PIG-3481: -- Environment: hadoop 1.0.3, hbase 0.94.1 Unable to check name message -- Key: PIG-3481 URL: https://issues.apache.org/jira/browse/PIG-3481 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Environment: hadoop 1.0.3, hbase 0.94.1 Reporter: zhenyingshan I am trying to run a pig script in Java program, I get the following error sometimes but not all the time. Here is the snippet of the program and the exception I've got. I have /user/root directory created in hdfs. -- URL path = getClass().getClassLoader().getResource(cfg/concatall.py); LOG.info(CDNResolve2Hbase: reading concatall.py file from + path.toString()); pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME, CDNResolve2Hbase); pigServer.registerQuery(A = load ' + inputPath + ' using PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, firsttime:chararray, updatetime:chararray);); pigServer.registerCode(path.toString(),jython,myfunc); pigServer.registerQuery(B = foreach A generate myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, SUBSTRING(firsttime,0,8);); outputTable = hbase:// + outputTable; ExecJob job = pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn d:dtime')); - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.PigServer.registerQuery(PigServer.java:529) at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown Source) at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source) at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273) at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264) at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86) at org.quartz.core.JobRunShell.run(JobRunShell.java:203) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520) Caused by: Failed to parse: Pig script failed to parse: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599) ... 15 more Caused by: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1315) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:799) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:517) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:392) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) ... 16 more Caused by: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:207) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:128) at
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777343#comment-13777343 ] Brian ONeill commented on PIG-3453: --- First question, for DISTINCT within Storm, do you believe we should have a sliding time window within which we perform the distinct? There is mention of the fact that it will be stateful (since we need to keep a set in memory with which to de-dupe). Do we intend to leverage the concept of Trident State for this? (which may make sense, implement State then on each commit/flush perform the de-duping) thoughts? Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Reporter: Pradeep Gollakota Labels: storm There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777355#comment-13777355 ] Brian ONeill commented on PIG-3453: --- Also, we could perform DISTINCT using a backend storage mechanism (like Cassandra), where we first check storage to see if the tuple exists, if it does not, we emit. If we first route all the same tuples to a single bolt, then check from there that may work (eliminating the potential for two bolts to check for existence at the same time). Using backend storage would allow someone to perform a true DISTINCT operation. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Reporter: Pradeep Gollakota Labels: storm There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3481) Unable to check name message
[ https://issues.apache.org/jira/browse/PIG-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhenyingshan resolved PIG-3481. --- Resolution: Fixed Fix Version/s: 0.11.1 It turns out that because my static PigServer is not thread-safe, I use ThreadLocal and the problem does not appear anymore. -- private static ThreadLocalPigServer pigServer = new ThreadLocalPigServer(); public static PigServer getServer() { if (pigServer.get() == null) { try { printClassPath(); Properties prop = SystemUtils.getCfg(); pigServer.set(new PigServer(ExecType.MAPREDUCE, prop)); return pigServer.get(); } catch (Exception e) { LOG.error(error in starting PigServer:, e); return null; } } return pigServer.get(); } Unable to check name message -- Key: PIG-3481 URL: https://issues.apache.org/jira/browse/PIG-3481 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Environment: hadoop 1.0.3, hbase 0.94.1 Reporter: zhenyingshan Fix For: 0.11.1 I am trying to run a pig script in Java program, I get the following error sometimes but not all the time. Here is the snippet of the program and the exception I've got. I have /user/root directory created in hdfs. -- URL path = getClass().getClassLoader().getResource(cfg/concatall.py); LOG.info(CDNResolve2Hbase: reading concatall.py file from + path.toString()); pigServer.getPigContext().getProperties().setProperty(PigContext.JOB_NAME, CDNResolve2Hbase); pigServer.registerQuery(A = load ' + inputPath + ' using PigStorage('\t') as (ip:chararray, do:chararray, cn:chararray, cdn:chararray, firsttime:chararray, updatetime:chararray);); pigServer.registerCode(path.toString(),jython,myfunc); pigServer.registerQuery(B = foreach A generate myfunc.concatall('+ extractTimestamp(inputPath)+',ip,do,cn), cdn, SUBSTRING(firsttime,0,8);); outputTable = hbase:// + outputTable; ExecJob job = pigServer.store(B,outputTable,org.apache.pig.backend.hadoop.hbase.HBaseStorage('d:cdn d:dtime')); - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546) at org.apache.pig.PigServer.registerQuery(PigServer.java:516) at org.apache.pig.PigServer.registerQuery(PigServer.java:529) at com.hugedata.cdnserver.datanalysis.CDNResolve2Hbase.execute(Unknown Source) at com.hugedata.cdnserver.DatAnalysis.cdnResolve2Hbase(Unknown Source) at com.hugedata.cdnserver.task.HandleDomainNameLogTask.execute(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273) at org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean$MethodInvokingJob.executeInternal(MethodInvokingJobDetailFactoryBean.java:264) at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86) at org.quartz.core.JobRunShell.run(JobRunShell.java:203) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:520) Caused by: Failed to parse: Pig script failed to parse: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599) ... 15 more Caused by: line 6, column 4 pig script failed to validate: org.apache.pig.backend.datastorage.DataStorageException: ERROR 6007: Unable to check name hdfs://DC-001:9000/user/root at org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:835) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3236) at
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777408#comment-13777408 ] Jacob Perkins commented on PIG-3453: [~boneill], I haven't thought too hard about distinct yet myself. Since I'm really only thinking about Trident and not storm in general, doing a distinct strictly within a batch is one straightforward option. Unfortunately, from a user standpoint, I think this would be (a) minimally useful and (b) confusing. Instead we could implement something like an approximate distinct using an LRU cache? Maybe even go so far as to implement a SQF (which I haven't read in its entirety yet): http://www.vldb.org/pvldb/vol6/p589-dutta.pdf? Also, what about order by? In what sense is an unbounded stream ordered? I absolutely do not want to tie the storm/trident execution engine to an external data store such as cassandra. Pig is supposed to be backend agnostic. Maybe the -default- tap and sink can be Kafka (tap) and Cassandra (sink). Finally, it should be possible to run a pig script in storm local mode. And [~pradeepg26] I'm actually well on the way to having nested foreach working. They way I'm working it now is each LogicalExpressionPlan becomes its own Trident BaseFunction. Actually works quite nicely for now. I haven't gotten to aggregates yet. What I probably won't implement for the POC is the tap and sink. Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Reporter: Pradeep Gollakota Labels: storm There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3453) Implement a Storm backend to Pig
[ https://issues.apache.org/jira/browse/PIG-3453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777421#comment-13777421 ] Brian ONeill commented on PIG-3453: --- [~thedatachef] Good points/suggestions. I'll have a look at bot LRU and SQF. RE: Cassandra Sorry, I didn't mean to imply we would create a hard dependency. I meant we could leverage the Trident State abstraction. (My team happens to own the storm-cassandra Trident State implementation (https://github.com/hmsonline/storm-cassandra)) We would query the State to see if the tuple was processed. You could just as easily plug in any persistence mechanism. (e.g. https://github.com/nathanmarz/trident-memcached) Implement a Storm backend to Pig Key: PIG-3453 URL: https://issues.apache.org/jira/browse/PIG-3453 Project: Pig Issue Type: New Feature Reporter: Pradeep Gollakota Labels: storm There is a lot of interest around implementing a Storm backend to Pig for streaming processing. The proposal and initial discussions can be found at https://cwiki.apache.org/confluence/display/PIG/Pig+on+Storm+Proposal -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3458: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Patch committed to branch-0.12 and trunk. Thanks Mark and Daniel for your feedback! ScalarExpression lost with multiquery optimization -- Key: PIG-3458 URL: https://issues.apache.org/jira/browse/PIG-3458 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Assignee: Koji Noguchi Fix For: 0.12.0 Attachments: pig-3458-v01.patch, pig-3458-v02.patch Our user reported an issue where their scalar results goes missing when having two store statements. {noformat} A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long); B = group A all; C = foreach B generate SUM(A.count) as total ; store C into 'deleteme6_C' using PigStorage(','); Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray ); Y = group Z by id; X = foreach Y generate group, C.total; store X into 'deleteme6_X' using PigStorage(','); Inputs pig cat test1.txt a 1 b 2 c 8 d 9 pig cat test2.txt a z b y c x pig {noformat} Result X should contain the total count of '20' but instead it's empty. {noformat} pig cat deleteme6_C/part-r-0 20 pig cat deleteme6_X/part-r-0 x, y, z, pig {noformat} This works if we take out first store C statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3458) ScalarExpression lost with multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3458: -- Component/s: parser ScalarExpression lost with multiquery optimization -- Key: PIG-3458 URL: https://issues.apache.org/jira/browse/PIG-3458 Project: Pig Issue Type: Bug Components: parser Reporter: Koji Noguchi Assignee: Koji Noguchi Fix For: 0.12.0 Attachments: pig-3458-v01.patch, pig-3458-v02.patch Our user reported an issue where their scalar results goes missing when having two store statements. {noformat} A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long); B = group A all; C = foreach B generate SUM(A.count) as total ; store C into 'deleteme6_C' using PigStorage(','); Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray ); Y = group Z by id; X = foreach Y generate group, C.total; store X into 'deleteme6_X' using PigStorage(','); Inputs pig cat test1.txt a 1 b 2 c 8 d 9 pig cat test2.txt a z b y c x pig {noformat} Result X should contain the total count of '20' but instead it's empty. {noformat} pig cat deleteme6_C/part-r-0 20 pig cat deleteme6_X/part-r-0 x, y, z, pig {noformat} This works if we take out first store C statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-19) A=load causes parse error
[ https://issues.apache.org/jira/browse/PIG-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-19: -- Fix Version/s: (was: 0.12.0) 0.13.0 A=load causes parse error - Key: PIG-19 URL: https://issues.apache.org/jira/browse/PIG-19 Project: Pig Issue Type: Bug Components: grunt Reporter: Olga Natkovich Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.13.0 Parser expects spaces around =. This should be a minor change in src/org/apache/pig/tools/grunt/GruntParser.jj -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1151) Date Conversion + Arithmetic UDFs
[ https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1151: Fix Version/s: (was: 0.12.0) 0.13.0 Date Conversion + Arithmetic UDFs - Key: PIG-1151 URL: https://issues.apache.org/jira/browse/PIG-1151 Project: Pig Issue Type: New Feature Reporter: sam rash Priority: Minor Labels: patch Fix For: 0.13.0 Attachments: patch_dateudf.tar.gz I would like to offer up some very simple data UDFs I have that wrap JodaTime (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate on ISO8601 date strings. (for piggybank). Please advise if these are appropriate. 1. Date Arithmetic takes an input string: 2009-01-01T13:43:33.000Z (and partial ones such as 2009-01-02) and a timespan (as millis or as string shorthand) returns an ISO8601 string that adjusts the input date by the specified timespan DatePlus(long timeMs); // + or - number works, is the # of millis DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc DateMinus(String timespan); //propose explicit minus when using string shorthand for time periods 2. Date Comparison (when you don't have full strings that you can use string compare with): DateIsBefore(String dateString); //true if lhs is before rhs DateIsAfter(String dateString); //true if lsh is after rhs 3. date trunc functions: takes partial ISO8601 strings and truncates to: toMinute(String dateString); toHour(String dateString); toDay(String dateString); toWeek(String dateString); toMonth(String dateString); toYear(String dateString); if any/all are helpful, I'm happy to contribute to pig -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1967) deprecate current syntax for casting relation as scalar, to use explicit cast to tuple
[ https://issues.apache.org/jira/browse/PIG-1967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1967: Fix Version/s: (was: 0.12.0) 0.13.0 deprecate current syntax for casting relation as scalar, to use explicit cast to tuple -- Key: PIG-1967 URL: https://issues.apache.org/jira/browse/PIG-1967 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0, 0.9.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.13.0 Attachments: PIG-1967.0.patch When the feature added in PIG-1434, there was a proposal to cast it to tuple, to be able to use as a scalar. But for some reason, in the implementation this cast was not required. See - https://issues.apache.org/jira/browse/PIG-1434?focusedCommentId=12888449page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12888449 The current syntax which does not need this cast seems to lead to lot of confusion among users, who end up using this feature unintentionally. This usually happens because the user is referring to the bag column(s) in output of (co)group using a wrong name, which happens to be another relation. Often, users realize the mistake only at runtime. New users, will have trouble figuring out what was wrong. I think we should support the use of the cast as originally proposed, and deprecate the current syntax. The warning issued when the deprecated syntax is used is likely to help users realize that they have unintentionally used this feature. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1919) order-by on bag gives error only at runtime
[ https://issues.apache.org/jira/browse/PIG-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1919: Fix Version/s: (was: 0.12.0) 0.13.0 order-by on bag gives error only at runtime --- Key: PIG-1919 URL: https://issues.apache.org/jira/browse/PIG-1919 Project: Pig Issue Type: Bug Affects Versions: 0.8.0, 0.9.0 Reporter: Thejas M Nair Assignee: Jonathan Coveney Fix For: 0.13.0 Attachments: PIG-1919-0.patch, PIG-1919-1.patch, PIG-1919-1.patch Order-by on a bag or tuple should give error at query compile time, instead of giving an error at runtime. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2409) Pig show wrong tracking URL for hadoop 2
[ https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2409: Summary: Pig show wrong tracking URL for hadoop 2 (was: Tracking URL for hadoop 23 does not show up) Pig show wrong tracking URL for hadoop 2 Key: PIG-2409 URL: https://issues.apache.org/jira/browse/PIG-2409 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.2, 0.10.0, 0.11 Reporter: Daniel Dai Assignee: Daniel Dai Priority: Minor Labels: hadoop023 Fix For: 0.12.0 Pig used to show a tracking url for hadoop job: More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001 This information does not show up in hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2122) Parameter Substitution doesn't work in the Grunt shell
[ https://issues.apache.org/jira/browse/PIG-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2122: Fix Version/s: (was: 0.12.0) 0.13.0 Parameter Substitution doesn't work in the Grunt shell -- Key: PIG-2122 URL: https://issues.apache.org/jira/browse/PIG-2122 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0, 0.8.1, 0.12.0 Reporter: Grant Ingersoll Assignee: Daniel Dai Priority: Minor Fix For: 0.13.0 Simple param substitution and things like %declare (as copied out of the docs) don't work in the grunt shell. #Start Pig with: Start Pig with: bin/pig -x local -p time=FOO {quote} foo = LOAD '/user/grant/foo.txt' AS (a:chararray, b:chararray, c:chararray); Y = foreach foo generate *, '$time'; dump Y; {quote} Output: {quote} 2011-06-13 20:22:24,197 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (1 2 3,,,$time) (4 5 6,,,$time) {quote} Same script, stored in junk.pig, run as: bin/pig -x local -p time=FOO junk.pig {quote} 2011-06-13 20:23:38,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (1 2 3,,,FOO) (4 5 6,,,FOO) {quote} Also, things like don't work (nor does %declare): {quote} grunt %default DATE '20090101'; 2011-06-13 20:18:19,943 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered PATH %default at line 1, column 1. Was expecting one of: EOF cat ... fs ... sh ... cd ... cp ... copyFromLocal ... copyToLocal ... dump ... describe ... aliases ... explain ... help ... kill ... ls ... mv ... mkdir ... pwd ... quit ... register ... rm ... rmf ... set ... illustrate ... run ... exec ... scriptDone ... ... EOL ... ; ... Details at logfile: /Users/grant.ingersoll/projects/apache/pig/release-0.8.1/pig_1308002917912.log {quote} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2247) Pig parser does not detect multiple arguments with the same name passed to macro
[ https://issues.apache.org/jira/browse/PIG-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2247: Fix Version/s: (was: 0.12.0) 0.13.0 Pig parser does not detect multiple arguments with the same name passed to macro Key: PIG-2247 URL: https://issues.apache.org/jira/browse/PIG-2247 Project: Pig Issue Type: Bug Components: parser Affects Versions: 0.9.0 Reporter: Alan Gates Assignee: Johnny Zhang Priority: Minor Fix For: 0.13.0 Attachments: PIG-2247.patch.txt Pig accepts a macro like {code} define simple_macro(in_relation, min_gpa, min_gpa) returns c { b = filter $in_relation by gpa = $min_gpa; $c = foreach b generate age, name; {code} This should produce an error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2409) Pig show wrong tracking URL for hadoop 2
[ https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2409: Fix Version/s: (was: 0.12.0) 0.13.0 Pig show wrong tracking URL for hadoop 2 Key: PIG-2409 URL: https://issues.apache.org/jira/browse/PIG-2409 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.2, 0.10.0, 0.11 Reporter: Daniel Dai Assignee: Daniel Dai Priority: Minor Labels: hadoop023 Fix For: 0.13.0 Pig used to show a tracking url for hadoop job: More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001 This information does not show up in hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2409) Pig show wrong tracking URL for hadoop 2
[ https://issues.apache.org/jira/browse/PIG-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777891#comment-13777891 ] Daniel Dai commented on PIG-2409: - Hadoop 2 shows the right tracking url now. However, Pig will print a redundant message which contains a wrong url. We need to remove it in Pig on Hadoop 2. Pig show wrong tracking URL for hadoop 2 Key: PIG-2409 URL: https://issues.apache.org/jira/browse/PIG-2409 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.2, 0.10.0, 0.11 Reporter: Daniel Dai Assignee: Daniel Dai Priority: Minor Labels: hadoop023 Fix For: 0.12.0 Pig used to show a tracking url for hadoop job: More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201112071119_0001 This information does not show up in hadoop 23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2495) Using merge JOIN from a HBaseStorage produces an error
[ https://issues.apache.org/jira/browse/PIG-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2495: Fix Version/s: (was: 0.12.0) 0.13.0 Using merge JOIN from a HBaseStorage produces an error -- Key: PIG-2495 URL: https://issues.apache.org/jira/browse/PIG-2495 Project: Pig Issue Type: Bug Affects Versions: 0.9.1, 0.9.2 Environment: HBase 0.90.3, Hadoop 0.20-append Reporter: Kevin Lion Assignee: Kevin Lion Fix For: 0.13.0 Attachments: PIG-2495.patch To increase performance of my computation, I would like to use a merge join between two tables to increase speed computation but it produces an error. Here is the script: {noformat} start_sessions = LOAD 'hbase://startSession.bea00.dev.ubithere.com' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:infoid meta:imei meta:timestamp', '-loadKey') AS (sid:chararray, infoid:chararray, imei:chararray, start:long); end_sessions = LOAD 'hbase://endSession.bea00.dev.ubithere.com' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('meta:timestamp meta:locid', '-loadKey') AS (sid:chararray, end:long, locid:chararray); sessions = JOIN start_sessions BY sid, end_sessions BY sid USING 'merge'; STORE sessions INTO 'sessionsTest' USING PigStorage ('*'); {noformat} Here is the result of this script : {noformat} 2012-01-30 16:12:43,920 [main] INFO org.apache.pig.Main - Logging error messages to: /root/pig_1327939963919.log 2012-01-30 16:12:44,025 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://lxc233:9000 2012-01-30 16:12:44,102 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: lxc233:9001 2012-01-30 16:12:44,760 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: MERGE_JION 2012-01-30 16:12:44,923 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2012-01-30 16:12:44,982 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2 2012-01-30 16:12:44,982 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2 2012-01-30 16:12:45,001 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2012-01-30 16:12:45,006 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:host.name=lxc233.machine.com 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.6.0_22 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Sun Microsystems Inc. 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.22/jre 2012-01-30 16:12:45,039 [main] INFO org.apache.zookeeper.ZooKeeper - Client
[jira] [Updated] (PIG-3021) Split results missing records when there is null values in the column comparison
[ https://issues.apache.org/jira/browse/PIG-3021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3021: --- Fix Version/s: (was: 0.12.0) 0.13.0 Moving it to 0.13 since the 0.12 branch is already cut. Split results missing records when there is null values in the column comparison Key: PIG-3021 URL: https://issues.apache.org/jira/browse/PIG-3021 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Reporter: Chang Luo Assignee: Cheolsoo Park Fix For: 0.13.0 Attachments: PIG-3021-2.patch, PIG-3021-3.patch, PIG-3021.patch Suppose a(x, y) split a into b if x==y, c otherwise; One will expect the union of b and c will be a. However, if x or y is null, the record won't appear in either b or c. To workaround this, I have to change to the following: split a into b if x is not null and y is not null and x==y, c otherwise; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2461) Simplify schema syntax for cast
[ https://issues.apache.org/jira/browse/PIG-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2461: Fix Version/s: (was: 0.12.0) 0.13.0 Simplify schema syntax for cast --- Key: PIG-2461 URL: https://issues.apache.org/jira/browse/PIG-2461 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.10.0 Reporter: Daniel Dai Fix For: 0.13.0 Cast into a bag/tuple syntax is confusing: {code} b = foreach a generate (bag{tuple(int,double)})bag0; {code} It's pretty hard to get it right for users. We should make key word bag/tuple optional. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs
[ https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3370: --- Fix Version/s: (was: 0.12.0) 0.13.0 Moving it to 0.13 for now. Add New Reserved Keywords To The Pig Docs - Key: PIG-3370 URL: https://issues.apache.org/jira/browse/PIG-3370 Project: Pig Issue Type: Task Components: documentation, parser Reporter: Sergey Goder Assignee: Cheolsoo Park Priority: Trivial Fix For: 0.13.0 The following are reserved keywords in Pig that are not included in the 11.1 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords) cube, dense, rank, returns, rollup, void Please add to any that I may have overlooked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2446) Fix map input bytes for hadoop 20.203+
[ https://issues.apache.org/jira/browse/PIG-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2446: Fix Version/s: (was: 0.12.0) 0.13.0 Fix map input bytes for hadoop 20.203+ -- Key: PIG-2446 URL: https://issues.apache.org/jira/browse/PIG-2446 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.2, 0.10.0, 0.11 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.13.0 From hadoop 20.203+, HDFS_BYTES_READ change the meaning. It no longer means the size of input files, it is the total hdfs bytes read for the job. Pig need a way to get the map input bytes to retain the old behavior. TestPigRunner.testGetHadoopCounters is testing that and is temporary disabled for hadoop 203+. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2584) Command line arguments for Pig script
[ https://issues.apache.org/jira/browse/PIG-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2584: Fix Version/s: (was: 0.12.0) 0.13.0 Command line arguments for Pig script - Key: PIG-2584 URL: https://issues.apache.org/jira/browse/PIG-2584 Project: Pig Issue Type: Improvement Components: impl Reporter: Daniel Dai Priority: Minor Fix For: 0.13.0 We did that for Jython embeded script. It is also useful in Pig script itself: command line: pig a.pig student.txt output a.pig: a = load '$1' as (a0, a1); store a into '$2'; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2521) explicit reference to namenode path with streaming results in an error
[ https://issues.apache.org/jira/browse/PIG-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2521: Fix Version/s: (was: 0.12.0) 0.13.0 explicit reference to namenode path with streaming results in an error -- Key: PIG-2521 URL: https://issues.apache.org/jira/browse/PIG-2521 Project: Pig Issue Type: Bug Affects Versions: 0.9.2 Reporter: Araceli Henley Priority: Minor Fix For: 0.13.0 I set this to minor because this test works with client side tables and with old style references. :: /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327441396/dotNext_baseline_15.pig :: THIS TEST FAILS. It uses an explicit reference to namenode1 (hdfs://namenode1.domain.com:8020) define CMD `perl PigStreamingDepend.pl` input(stdin) ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl', '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm'); A = load 'hdfs://namdenode1.domain.com:8020/user/hadoopqa/pig/tests/data'; B = stream A through `perl PigStreaming.pl`; C = stream B through CMD as (name, age, gpa); D = foreach C generate name, age; store D into 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out'; fs -cp hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out /user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_15.out :: /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327441396/dotNext_baseline_1.pig :: This test PASSES. It uses an explicit reference to NN1(hdfs://namenode1.domain.com:8020) for load and store a = load 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); store a into 'hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out' ; fs -cp hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/out1/user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out /user/hadoopqa/pig/out/hadoopqa.1327441396/dotNext_baseline_1.out THE REMAINING TESTS ARE IDENTICAL EXCEPT FOR THE FILE REFERNCE: explicit vs mount point :: /grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327433551/dotNext_baseline_15.pig :: This test PASSES. Its the baseline for the test, it uses old style references. define CMD `perl PigStreamingDepend.pl` input(stdin) ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl', '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm'); A = load '/user/hadoopqa/pig/tests/data'; B = stream A through `perl PigStreaming.pl`; C = stream B through CMD as (name, age, gpa); D = foreach C generate name, age; store D into '/user/hadoopqa/pig/out/hadoopqa.1327433551/dotNext_baseline_15.out'; :: grid/2/dev/pigqa/out/pigtest/hadoopqa/hadoopqa.1327431567/dotNext_baseline_15.pig :: This test PASSES. It uses a mount point to namenode 1( /data1 is a mount point for hdfs://namenode1.domain.com:8020/user/hadoopqa/pig/tests/data). define CMD `perl PigStreamingDepend.pl` input(stdin) ship('/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingDepend.pl', '/homes/araceli/pigtest/pigtest_next/pigharness/dist/pig_harness/libexec/PigTest/PigStreamingModule.pm'); A = load '/data1'; B = stream A through `perl PigStreaming.pl`; C = stream B through CMD as (name, age, gpa); D = foreach C generate name, age; store D into '/out1/user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out'; fs -cp /out1/user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out /user/hadoopqa/pig/out/hadoopqa.1327431567/dotNext_baseline_15.out -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2537) Output from flatten with a null tuple input generating data inconsistent with the schema
[ https://issues.apache.org/jira/browse/PIG-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2537: Fix Version/s: (was: 0.12.0) 0.13.0 Output from flatten with a null tuple input generating data inconsistent with the schema Key: PIG-2537 URL: https://issues.apache.org/jira/browse/PIG-2537 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.9.0 Reporter: Xuefu Zhang Assignee: Daniel Dai Fix For: 0.13.0 Attachments: PIG-2537-1.patch, PIG-2537-2.patch, PIG-2537-3.patch For the following pig script, grunt A = load 'file' as ( a : tuple( x, y, z ), b, c ); grunt B = foreach A generate flatten( $0 ), b, c; grunt describe B; B: {a::x: bytearray,a::y: bytearray,a::z: bytearray,b: bytearray,c: bytearray} Alias B has a clear schema. However, on the backend, for a row if $0 happens to be null, then output tuple become something like (null, b_value, c_value), which is obviously inconsistent with the schema. The behaviour is confirmed by pig code inspection. This inconsistency corrupts data because of position shifts. Expected output row should be something like (null, null, null, b_value, c_value). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2599) Mavenize Pig
[ https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2599: Fix Version/s: (was: 0.12.0) 0.13.0 Mavenize Pig Key: PIG-2599 URL: https://issues.apache.org/jira/browse/PIG-2599 Project: Pig Issue Type: New Feature Components: build Reporter: Daniel Dai Labels: gsoc2013 Fix For: 0.13.0 Attachments: maven-pig.1.zip Switch Pig build system from ant to maven. This is a candidate project for Google summer of code 2013. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2013 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2625) Allow use of JRuby for control flow
[ https://issues.apache.org/jira/browse/PIG-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2625: Fix Version/s: (was: 0.12.0) 0.13.0 Allow use of JRuby for control flow --- Key: PIG-2625 URL: https://issues.apache.org/jira/browse/PIG-2625 Project: Pig Issue Type: New Feature Reporter: Jonathan Coveney Fix For: 0.13.0 Much like people can use jython for iterative computation, it'd be great to use JRuby for the same -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.
[ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Karn updated PIG-2417: - Attachment: PIG-2417-unicode.patch Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. - Key: PIG-2417 URL: https://issues.apache.org/jira/browse/PIG-2417 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Jeremy Karn Assignee: Jeremy Karn Fix For: 0.12.0 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, streaming2.patch, streaming3.patch, streaming.patch The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages with no JVM implementation or a limited JVM implementation. The initial proposal is outlined here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs. In order to implement this we need new syntax to distinguish a streaming UDF from an embedded JVM UDF. I'd propose something like the following (although I'm not sure 'language' is the best term to be using): {code}define my_streaming_udfs language('python') ship('my_streaming_udfs.py'){code} We'll also need a language-specific controller script that gets shipped to the cluster which is responsible for reading the input stream, deserializing the input data, passing it to the user written script, serializing that script output, and writing that to the output stream. Finally, we'll need to add a StreamingUDF class that extends evalFunc. This class will likely share some of the existing code in POStream and ExecutableManager (where it make sense to pull out shared code) to stream data to/from the controller script. One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream operator directly. This would involve inserting the POStream operator instead of the POUserFunc operator whenever we encountered a streaming UDF while building the physical plan. This approach seemed problematic because there would need to be a lot of changes in order to support POStream in all of the places we want to be able use UDFs (For example - to operate on a single field inside of a for each statement). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2591) Unit tests should not write to /tmp but respect java.io.tmpdir
[ https://issues.apache.org/jira/browse/PIG-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2591: Fix Version/s: (was: 0.12.0) 0.13.0 Unit tests should not write to /tmp but respect java.io.tmpdir -- Key: PIG-2591 URL: https://issues.apache.org/jira/browse/PIG-2591 Project: Pig Issue Type: Bug Components: tools Reporter: Thomas Weise Assignee: Jarek Jarcec Cecho Fix For: 0.13.0 Attachments: bugPIG-2591.patch, PIG-2495.patch Several tests use /tmp but should derive temporary file location from java.io.tmpdir to avoid side effects (java.io.tmpdir is already set to a test run specific location in build.xml) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2595) BinCond only works inside parentheses
[ https://issues.apache.org/jira/browse/PIG-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2595: Fix Version/s: (was: 0.12.0) 0.13.0 BinCond only works inside parentheses - Key: PIG-2595 URL: https://issues.apache.org/jira/browse/PIG-2595 Project: Pig Issue Type: Bug Reporter: Daniel Dai Fix For: 0.13.0 Not sure if we have a Jira for this before. This script does not work: {code} a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double, instate:chararray); b = foreach a generate name, instate=='true'?gpa:gpa+1; dump b; {code} If we put bincond into parentheses, it works {code} a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as (name, age:int, gpa:double, instate:chararray); b = foreach a generate name, (instate=='true'?gpa:gpa+1); dump b; {code} Exception: ERROR 1200: file 40.pig, line 2, column 36 mismatched input '==' expecting SEMI_COLON org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. file 40.pig, line 2, column 36 mismatched input '==' expecting SEMI_COLON at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1541) at org.apache.pig.PigServer.registerQuery(PigServer.java:541) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:945) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:392) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:599) at org.apache.pig.Main.main(Main.java:153) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: Failed to parse: file 40.pig, line 2, column 36 mismatched input '==' expecting SEMI_COLON at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:226) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:168) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1590) ... 14 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2624) Handle recursive inclusion of scripts in JRuby UDFs
[ https://issues.apache.org/jira/browse/PIG-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2624: Fix Version/s: (was: 0.12.0) 0.13.0 Handle recursive inclusion of scripts in JRuby UDFs --- Key: PIG-2624 URL: https://issues.apache.org/jira/browse/PIG-2624 Project: Pig Issue Type: Improvement Affects Versions: 0.10.0, 0.11 Reporter: Jonathan Coveney Labels: JRuby Fix For: 0.13.0 Currently, if you have a script which require's another script, the dependency won't be properly handled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.
[ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777997#comment-13777997 ] Jeremy Karn commented on PIG-2417: -- I can't reproduce the problem but I think PIG-2417-unicode.patch should fix the encoding issue. Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. - Key: PIG-2417 URL: https://issues.apache.org/jira/browse/PIG-2417 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Jeremy Karn Assignee: Jeremy Karn Fix For: 0.12.0 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, streaming2.patch, streaming3.patch, streaming.patch The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages with no JVM implementation or a limited JVM implementation. The initial proposal is outlined here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs. In order to implement this we need new syntax to distinguish a streaming UDF from an embedded JVM UDF. I'd propose something like the following (although I'm not sure 'language' is the best term to be using): {code}define my_streaming_udfs language('python') ship('my_streaming_udfs.py'){code} We'll also need a language-specific controller script that gets shipped to the cluster which is responsible for reading the input stream, deserializing the input data, passing it to the user written script, serializing that script output, and writing that to the output stream. Finally, we'll need to add a StreamingUDF class that extends evalFunc. This class will likely share some of the existing code in POStream and ExecutableManager (where it make sense to pull out shared code) to stream data to/from the controller script. One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream operator directly. This would involve inserting the POStream operator instead of the POUserFunc operator whenever we encountered a streaming UDF while building the physical plan. This approach seemed problematic because there would need to be a lot of changes in order to support POStream in all of the places we want to be able use UDFs (For example - to operate on a single field inside of a for each statement). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2628) Allow in line scripting UDF definitions
[ https://issues.apache.org/jira/browse/PIG-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2628: Fix Version/s: (was: 0.12.0) 0.13.0 Allow in line scripting UDF definitions --- Key: PIG-2628 URL: https://issues.apache.org/jira/browse/PIG-2628 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Fix For: 0.13.0 For small udfs in scripting languages, it may be cumbersome to force users to make a script, put it on the classpath, ship it, etc. It would be great to support a syntax that allows people to declare UDFs in line (essentially, to define a snippet of code that will be interpreted as a scriptlet) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2630) Issue with setting b = a;
[ https://issues.apache.org/jira/browse/PIG-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2630: Fix Version/s: (was: 0.12.0) 0.13.0 Issue with setting b = a; --- Key: PIG-2630 URL: https://issues.apache.org/jira/browse/PIG-2630 Project: Pig Issue Type: Bug Affects Versions: 0.10.0, 0.11 Reporter: Jonathan Coveney Fix For: 0.13.0 The following gives an error: {code} a = load 'thing' as (x:int); b = a; c = join a by x, b by x; {code} Error: {code} 2012-04-03 14:02:47,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: line 14, column 4 pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! {code} No issue with the following, however {code} a = load 'thing' as (x:int); b = foreach a generate *; c = join a by x, b by x; {code} oh and here is the log: {code} $ cat pig_1333487146863.log Pig Stack Trace --- ERROR 1200: Pig script failed to parse: line 3, column 4 pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! Failed to parse: Pig script failed to parse: line 3, column 4 pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:182) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1566) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1539) at org.apache.pig.PigServer.registerQuery(PigServer.java:541) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:945) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:392) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:535) at org.apache.pig.Main.main(Main.java:153) Caused by: line 3, column 4 pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225: Projection with nothing to reference! at org.apache.pig.parser.LogicalPlanBuilder.buildJoinOp(LogicalPlanBuilder.java:363) at org.apache.pig.parser.LogicalPlanGenerator.join_clause(LogicalPlanGenerator.java:11441) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1491) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:791) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:509) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:384) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:175) ... 10 more {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request 14274: PIG-2672 Optimize the use of DistributedCache
On Sept. 24, 2013, 6:30 a.m., Cheolsoo Park wrote: trunk/test/org/apache/pig/test/TestJobControlCompiler.java, line 161 https://reviews.apache.org/r/14274/diff/1/?file=355177#file355177line161 The following line is missing in the RB diff but it's in the attached the patch: properties.setProperty(PigConstants.PIG_SHARED_CACHE_ENABLED_KEY, true); Just pointing it out. I realized that we do not need to have PIG_SHARED_CACHE_ENABLED_KEY property for this. So, I removed this unnecessary property from the RB. I will attach the patch with the changes soon. - Aniket --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14274/#review26342 --- On Sept. 21, 2013, 1:21 a.m., Aniket Mokashi wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14274/ --- (Updated Sept. 21, 2013, 1:21 a.m.) Review request for pig, Cheolsoo Park, DanielWX DanielWX, Dmitriy Ryaboy, Julien Le Dem, and Rohini Palaniswamy. Bugs: PIG-2672 https://issues.apache.org/jira/browse/PIG-2672 Repository: pig Description --- added jar.cache.location option Diffs - trunk/src/org/apache/pig/PigConstants.java 1525188 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1525188 trunk/src/org/apache/pig/impl/PigContext.java 1525188 trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1525188 trunk/test/org/apache/pig/test/TestJobControlCompiler.java 1525188 Diff: https://reviews.apache.org/r/14274/diff/ Testing --- Thanks, Aniket Mokashi
Re: Review Request 14274: PIG-2672 Optimize the use of DistributedCache
On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote: trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java, line 1492 https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1492 If hdfs path use as is and do not ship to jar cache. It will also save time and hash checks. Currently, PigServer#registerJar localizes all the jars. So, this would need some more refactor before we can do this. I will try to solve this in a separate jira. On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote: trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java, line 1495 https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1495 Since the name of the file on hdfs is different from that of the actual file, create a symlink with the actual filename. Some users might depend on the actual file name. Rohini Palaniswamy wrote: One case I see is python scripts(jython UDFs) which do imports based on the file name. Would be the same for other scripting languages that we support. It would be good to run the full unit and e2e test with your patch before going for a commit May be I should avoid renaming the files and just put them under /a/b/c/abcdefsha1/udf.jar. On Sept. 25, 2013, 12:13 a.m., Rohini Palaniswamy wrote: trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java, line 1509 https://reviews.apache.org/r/14274/diff/1/?file=355174#file355174line1509 First do a file size comparison before calculating checksum for better efficiency Size check would require stat calls to nn, this being local should be quicker than that. - Aniket --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14274/#review26364 --- On Sept. 21, 2013, 1:21 a.m., Aniket Mokashi wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/14274/ --- (Updated Sept. 21, 2013, 1:21 a.m.) Review request for pig, Cheolsoo Park, DanielWX DanielWX, Dmitriy Ryaboy, Julien Le Dem, and Rohini Palaniswamy. Bugs: PIG-2672 https://issues.apache.org/jira/browse/PIG-2672 Repository: pig Description --- added jar.cache.location option Diffs - trunk/src/org/apache/pig/PigConstants.java 1525188 trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1525188 trunk/src/org/apache/pig/impl/PigContext.java 1525188 trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1525188 trunk/test/org/apache/pig/test/TestJobControlCompiler.java 1525188 Diff: https://reviews.apache.org/r/14274/diff/ Testing --- Thanks, Aniket Mokashi
[jira] [Updated] (PIG-2631) Pig should allow self joins
[ https://issues.apache.org/jira/browse/PIG-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2631: Fix Version/s: (was: 0.12.0) 0.13.0 Pig should allow self joins --- Key: PIG-2631 URL: https://issues.apache.org/jira/browse/PIG-2631 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Fix For: 0.13.0 This doesn't have to even be optimized, and can still involve a double scan of the data, but there is no reason the following should work: {code} a = load 'thing' as (x:int); b = join a by x, (foreach a generate *) by x; {code} but this does not: {code} a = load 'thing' as (x:int); b = join a by x, a by x; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2633) Create a SchemaBag which generates a Bag with a known Schema via code gen
[ https://issues.apache.org/jira/browse/PIG-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2633: Fix Version/s: (was: 0.12.0) 0.13.0 Create a SchemaBag which generates a Bag with a known Schema via code gen - Key: PIG-2633 URL: https://issues.apache.org/jira/browse/PIG-2633 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Assignee: Jonathan Coveney Fix For: 0.13.0 This is related to PIG-2632. The idea is to also extend that and create a known version based on a given Schema. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.
[ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778025#comment-13778025 ] Rohini Palaniswamy commented on PIG-2417: - +1 for PIG-2417-unicode.patch. That worked. Committed to 0.12 and trunk. Thanks Jeremy. Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. - Key: PIG-2417 URL: https://issues.apache.org/jira/browse/PIG-2417 Project: Pig Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Jeremy Karn Assignee: Jeremy Karn Fix For: 0.12.0 Attachments: PIG-2417-4.patch, PIG-2417-5.patch, PIG-2417-6.patch, PIG-2417-7.patch, PIG-2417-8.patch, PIG-2417-9-1.patch, PIG-2417-9-2.patch, PIG-2417-9.patch, PIG-2417-e2e.patch, PIG-2417-unicode.patch, streaming2.patch, streaming3.patch, streaming.patch The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages with no JVM implementation or a limited JVM implementation. The initial proposal is outlined here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs. In order to implement this we need new syntax to distinguish a streaming UDF from an embedded JVM UDF. I'd propose something like the following (although I'm not sure 'language' is the best term to be using): {code}define my_streaming_udfs language('python') ship('my_streaming_udfs.py'){code} We'll also need a language-specific controller script that gets shipped to the cluster which is responsible for reading the input stream, deserializing the input data, passing it to the user written script, serializing that script output, and writing that to the output stream. Finally, we'll need to add a StreamingUDF class that extends evalFunc. This class will likely share some of the existing code in POStream and ExecutableManager (where it make sense to pull out shared code) to stream data to/from the controller script. One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream operator directly. This would involve inserting the POStream operator instead of the POUserFunc operator whenever we encountered a streaming UDF while building the physical plan. This approach seemed problematic because there would need to be a lot of changes in order to support POStream in all of the places we want to be able use UDFs (For example - to operate on a single field inside of a for each statement). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2834) MultiStorage requires unused constructor argument
[ https://issues.apache.org/jira/browse/PIG-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2834: Fix Version/s: (was: 0.12.0) 0.13.0 MultiStorage requires unused constructor argument - Key: PIG-2834 URL: https://issues.apache.org/jira/browse/PIG-2834 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.10.0, 0.11 Environment: Linux Reporter: Danny Antonetti Priority: Trivial Labels: newbie Fix For: 0.13.0 Attachments: MultiStorage.patch each constructor in org.apache.pig.piggybank.storage.MultiStorage requires a constructor argument 'parentPathStr, that has no meaningful usage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2880) Pig current releases lack a UDF charAt.This UDF returns the char value at the specified index.
[ https://issues.apache.org/jira/browse/PIG-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778036#comment-13778036 ] Daniel Dai commented on PIG-2880: - [~sunitha muralidharan], are you still working on it? Pig current releases lack a UDF charAt.This UDF returns the char value at the specified index. -- Key: PIG-2880 URL: https://issues.apache.org/jira/browse/PIG-2880 Project: Pig Issue Type: New Feature Components: piggybank Reporter: Sabir Ayappalli Labels: patch Fix For: 0.12.0 Attachments: CharAt.java.patch Pig current releases lack a UDF charAt.This UDF returns the char value at the specified index. An index ranges from 0 to length() - 1. The first char value of the sequence is at index 0, the next at index 1, and so on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2687) Add relation/operator scoping to Pig
[ https://issues.apache.org/jira/browse/PIG-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2687: Fix Version/s: (was: 0.12.0) 0.13.0 Add relation/operator scoping to Pig Key: PIG-2687 URL: https://issues.apache.org/jira/browse/PIG-2687 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Priority: Minor Fix For: 0.13.0 The idea is to add a real notion of scope that can be used to manage namespace. This would mean the addition of blocks to pig, probably with some sort of syntax like this... {code} a = load thing as (x:int, y:int); b = foreach a generate x, y, x*y as z; { a = group b by z; b = foreach a generate COUNT(b); global b; } {code} which would replace the alias b with the nested b value in the scope. This could also be used in nested foreach blocks, and macros could just become blocks as well. I am 95% sure about how to implement this... I have a failed patch attempt, and need to study a bit more about how Pig uses its logical operators. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps
[ https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778037#comment-13778037 ] Daniel Dai commented on PIG-2641: - [~russell.jurney], are you still working on it? Create toJSON function for all complex types: tuples, bags and maps --- Key: PIG-2641 URL: https://issues.apache.org/jira/browse/PIG-2641 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.12.0 Environment: Foggy. Damn foggy. Reporter: Russell Jurney Assignee: Russell Jurney Labels: chararray, fun, happy, input, json, output, pants, pig, piggybank, string, wonderdog Fix For: 0.12.0 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch Original Estimate: 96h Remaining Estimate: 96h It is a travesty that there are no UDFs in Piggybanks that, given an arbitrary Pig datatype, return a JSON string of same. I intend to fix this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2681) TestDriverPig.countStores() does not correctly count the number of stores for pig scripts using variables for the alias
[ https://issues.apache.org/jira/browse/PIG-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2681: Fix Version/s: (was: 0.12.0) 0.13.0 TestDriverPig.countStores() does not correctly count the number of stores for pig scripts using variables for the alias --- Key: PIG-2681 URL: https://issues.apache.org/jira/browse/PIG-2681 Project: Pig Issue Type: Test Components: e2e harness Affects Versions: 0.9.0, 0.9.1, 0.9.2, 0.10.0 Reporter: Araceli Henley Fix For: 0.13.0 Attachments: PIG-2681.patch For pig macros where the out parameter is referenced in a store statement, the TestDriveP.countStores() does not correctly count the number of stores: For example, the store will not be counted in : define myMacro(in1,in2) returns A { A = load '$in1' using PigStorage('$delimeter') as (intnum1000: int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: float,doublenum: double); store $A into '$out'; } countStores() matches with: $count += $q[$i] =~ /store\s+[a-zA-Z][a-zA-Z0-9_]*\s+into/i; Since the alias has a special character $ it doesn't count it and the test fails. Need to change this to: $count += $q[$i] =~ /store\s+(\$)?[a-zA-Z][a-zA-Z0-9_]*\s+into/i; I'll submit a patch shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2927) SHIP and use JRuby gems in JRuby UDFs
[ https://issues.apache.org/jira/browse/PIG-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2927: Fix Version/s: (was: 0.12.0) 0.13.0 SHIP and use JRuby gems in JRuby UDFs - Key: PIG-2927 URL: https://issues.apache.org/jira/browse/PIG-2927 Project: Pig Issue Type: New Feature Components: parser Affects Versions: 0.11 Environment: JRuby UDFs Reporter: Russell Jurney Assignee: Jonathan Coveney Priority: Minor Fix For: 0.13.0 Attachments: PIG-2927-0.patch, PIG-2927-1.patch, PIG-2927-2.patch, PIG-2927-3.patch, PIG-2927-4.patch It would be great to use JRuby gems in JRuby UDFs without installing them on all machines on the cluster. Some way to SHIP them automatically with the job would be great. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3008) Fix whitespace in Pig code
[ https://issues.apache.org/jira/browse/PIG-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3008: Fix Version/s: (was: 0.12.0) 0.13.0 Fix whitespace in Pig code -- Key: PIG-3008 URL: https://issues.apache.org/jira/browse/PIG-3008 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Fix For: 0.13.0 Attachments: checkstyle.xml This JIRA exists mainly to get a conversation started. We've talked about it before, and it's a tricky issue. That said, some of the Pig code is super, super gnarly. We need some sort of path that will let it eventually be fix-able. I posit: any file that hasn't been touched for over 6 months is eligible for a whitespace patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3055) Make it possible to register new script engines
[ https://issues.apache.org/jira/browse/PIG-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3055: Fix Version/s: (was: 0.12.0) 0.13.0 Make it possible to register new script engines --- Key: PIG-3055 URL: https://issues.apache.org/jira/browse/PIG-3055 Project: Pig Issue Type: Improvement Reporter: Greg Bowyer Assignee: Greg Bowyer Fix For: 0.13.0 Attachments: PIG-3055-Make-it-possible-to-register-a-script-engine.patch Hi shiny pig people I have recently been playing about with getting renjin to work as a script engine in pig in the same manner as jython / ruby etc. Renjin is a re-implementation of R in java. For now the renjin project is in its infancy and is probably not best suited to being bundled with pig, as such I need to be able to extend the ScriptEngine interface and register renjin as a suitable engine for pig to use. At present the parts of pig that know about script engines are not easily changed, attached is a patch that should make this possible. Thoughts ? ideas ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3010) Allow UDF's to flatten themselves
[ https://issues.apache.org/jira/browse/PIG-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778045#comment-13778045 ] Daniel Dai commented on PIG-3010: - [~jcoveney], are you still working on it? Allow UDF's to flatten themselves - Key: PIG-3010 URL: https://issues.apache.org/jira/browse/PIG-3010 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Assignee: Jonathan Coveney Fix For: 0.12.0 Attachments: PIG-3010-0.patch, PIG-3010-1.patch, PIG-3010-2_nowhitespace.patch, PIG-3010-2.patch, PIG-3010-3_nows.patch, PIG-3010-3.patch, PIG-3010-4_nows.patch, PIG-3010-4.patch, PIG-3010-5_nows.patch, PIG-3010-5.patch This is something I thought would be cool for a while, so I sat down and did it because I think there are some useful debugging tools it'd help with. The idea is that if you attach an annotation to a UDF, the Tuple or DataBag you output will be flattened. This is quite powerful. A very common pattern is: a = foreach data generate Flatten(MyUdf(thing)) as (a,b,c); This would let you just do: a = foreach data generate MyUdf(thing); With the exact same result! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3038) Support for Credentials for UDF,Loader and Storer
[ https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3038: Fix Version/s: (was: 0.12.0) 0.13.0 Support for Credentials for UDF,Loader and Storer - Key: PIG-3038 URL: https://issues.apache.org/jira/browse/PIG-3038 Project: Pig Issue Type: New Feature Affects Versions: 0.10.0 Reporter: Rohini Palaniswamy Fix For: 0.13.0 Pig does not have a clean way (APIs) to support adding Credentials (hbase token, hcat/hive metastore token) to Job and retrieving it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps
[ https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778061#comment-13778061 ] Russell Jurney commented on PIG-2641: - How long do I have to get this into 0.12? Is that still possible? Create toJSON function for all complex types: tuples, bags and maps --- Key: PIG-2641 URL: https://issues.apache.org/jira/browse/PIG-2641 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.12.0 Environment: Foggy. Damn foggy. Reporter: Russell Jurney Assignee: Russell Jurney Labels: chararray, fun, happy, input, json, output, pants, pig, piggybank, string, wonderdog Fix For: 0.12.0 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch Original Estimate: 96h Remaining Estimate: 96h It is a travesty that there are no UDFs in Piggybanks that, given an arbitrary Pig datatype, return a JSON string of same. I intend to fix this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps
[ https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778069#comment-13778069 ] Daniel Dai commented on PIG-2641: - Since 0.12 is already branched, we don't suppose to commit new features. Can we defer to 0.13.0? Create toJSON function for all complex types: tuples, bags and maps --- Key: PIG-2641 URL: https://issues.apache.org/jira/browse/PIG-2641 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.12.0 Environment: Foggy. Damn foggy. Reporter: Russell Jurney Assignee: Russell Jurney Labels: chararray, fun, happy, input, json, output, pants, pig, piggybank, string, wonderdog Fix For: 0.12.0 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch Original Estimate: 96h Remaining Estimate: 96h It is a travesty that there are no UDFs in Piggybanks that, given an arbitrary Pig datatype, return a JSON string of same. I intend to fix this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders
[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3404: -- Description: `There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. was: There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. Improve Pig to ignore bad files or inaccessible files or folders Key: PIG-3404 URL: https://issues.apache.org/jira/browse/PIG-3404 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.11.2 Reporter: Jerry Chen Labels: Rhino Attachments: PIG-3404.patch `There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases
[jira] [Updated] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders
[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3404: -- Description: There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. was: `There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. Improve Pig to ignore bad files or inaccessible files or folders Key: PIG-3404 URL: https://issues.apache.org/jira/browse/PIG-3404 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.11.2 Reporter: Jerry Chen Labels: Rhino Attachments: PIG-3404.patch There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases
[jira] [Commented] (PIG-3404) Improve Pig to ignore bad files or inaccessible files or folders
[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778119#comment-13778119 ] Koji Noguchi commented on PIG-3404: --- (I accidentally updated the description. sorry for the spam~) For this issue, can we use mapred.max.map.failures.percent (or mapreduce.map.failures.maxpercent in 2.*)? Improve Pig to ignore bad files or inaccessible files or folders Key: PIG-3404 URL: https://issues.apache.org/jira/browse/PIG-3404 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.11.2 Reporter: Jerry Chen Labels: Rhino Attachments: PIG-3404.patch There are use cases in Pig: * A directory is used as the input of a load operation. It is possible that one or more files in that directory are bad files (for example, corrupted or bad data caused by compression). * A directory is used as the input of a load operation. The current user may not have permission to access any subdirectories or files of that directory. The current Pig implementation will abort the whole Pig job for such cases. It would be useful to have option to allow the job to continue and ignore the bad files or inaccessible files/folders without abort the job, ideally, log or print a warning for such error or violations. This requirement is not trivial because for big data set for large analytics applications, this is not always possible to sort out the good data for processing; Ignore a few of bad files may be a better choice for such situations. We propose to use “Ignore bad files” flag to address this problem. AvroStorage and related file format in Pig already has this flag but it is not complete to cover all the cases mentioned above. We would improve the PigStorage and related text format to support this new flag as well as improve AvroStorage and related facilities to completely support the concept. The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be set for each load operation respectively. The value of this flag will be false if it is not explicitly set. Ideally, we can provide a global pig parameter which forces the default value to true for all load functions even if it is not explicitly set in the LOAD statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory.
Raviteja Chirala created PIG-3482: - Summary: Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory. Key: PIG-3482 URL: https://issues.apache.org/jira/browse/PIG-3482 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Environment: RHEL 6.0 Reporter: Raviteja Chirala Priority: Minor Fix For: 0.12.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory.
[ https://issues.apache.org/jira/browse/PIG-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raviteja Chirala updated PIG-3482: -- Description: When we run Mapper only jobs, All the intermediate outputs(compressed) are going to the user directory instead of going to tmp. If we run on small datasets, it shouldn't create a problem. But when I run for large datasets like more than 100TB lets say, it taking up so much disk space exceeding the disk space quota(setSpaceQuota) of 100GB also. Problem is happening before clean up. Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory. --- Key: PIG-3482 URL: https://issues.apache.org/jira/browse/PIG-3482 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Environment: RHEL 6.0 Reporter: Raviteja Chirala Priority: Minor Fix For: 0.12.1 When we run Mapper only jobs, All the intermediate outputs(compressed) are going to the user directory instead of going to tmp. If we run on small datasets, it shouldn't create a problem. But when I run for large datasets like more than 100TB lets say, it taking up so much disk space exceeding the disk space quota(setSpaceQuota) of 100GB also. Problem is happening before clean up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3482) Mapper only Jobs are not creating intermediate files in /tmp/, instead of creating in user directory.
[ https://issues.apache.org/jira/browse/PIG-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raviteja Chirala updated PIG-3482: -- Summary: Mapper only Jobs are not creating intermediate files in /tmp/, instead of creating in user directory. (was: Mapper only Jobs are not creating intermediate files in /tmp/, instead creating in user directory. ) Mapper only Jobs are not creating intermediate files in /tmp/, instead of creating in user directory. -- Key: PIG-3482 URL: https://issues.apache.org/jira/browse/PIG-3482 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Environment: RHEL 6.0 Reporter: Raviteja Chirala Priority: Minor Fix For: 0.12.1 When we run Mapper only jobs, All the intermediate outputs(compressed) are going to the user directory instead of going to tmp. If we run on small datasets, it shouldn't create a problem. But when I run for large datasets like more than 100TB lets say, it taking up so much disk space exceeding the disk space quota(setSpaceQuota) of 100GB also. Problem is happening before clean up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3083) Introduce new syntax that let's you project just the columns that come from a given :: prefix
[ https://issues.apache.org/jira/browse/PIG-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3083: Fix Version/s: (was: 0.12.0) 0.13.0 Introduce new syntax that let's you project just the columns that come from a given :: prefix - Key: PIG-3083 URL: https://issues.apache.org/jira/browse/PIG-3083 Project: Pig Issue Type: Bug Affects Versions: 0.12.0 Reporter: Jonathan Coveney Labels: PIG-3078 Fix For: 0.13.0 Attachments: pig_jira_aguin_3083.patch This is basically a more refined approach than PIG-3078, but it is also more work. That JIRA is more of a stopgap until we do something like this. The idea would be to support something like the following: a = load 'a' as (x,y,z); b = load 'b' as (x,y,z); c = join a by x, b by x; d = foreach c generate a::*; Obviously this is useful for any case where you have relations with columns with various prefixes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3087) Refactor TestLogicalPlanBuilder to be meaningful
[ https://issues.apache.org/jira/browse/PIG-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3087: Fix Version/s: (was: 0.12.0) 0.13.0 Refactor TestLogicalPlanBuilder to be meaningful Key: PIG-3087 URL: https://issues.apache.org/jira/browse/PIG-3087 Project: Pig Issue Type: Bug Reporter: Jonathan Coveney Labels: newbie Fix For: 0.13.0 Attachments: PIG-3087-0.patch I started doing this as part of another patch, but there are some bigger issues, and I don't have the time to dig in atm. That said, a lot of the tests as written don't test anything. I used more modern junit patterns, and discovered we had a lot of tests that weren't functioning properly. Making them function properly unveiled that the general buildLp pattern doesn't work properly anymore for many cases where it would throw an error in grunt, but for whatever reason no error is thrown in the tests. Any test with _1 is a test that previous failed, that now doesn't. Some, however, don't make sense so I think what really needs to be done is figure out which should be failing, which shouldn't, and then fix buildLp accordingly. I will attach my pass at it, but it is incomplete and needs work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3104) XMLLoader return Pig tuple/map/bag representation of the DOM of XML documents
[ https://issues.apache.org/jira/browse/PIG-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3104: Fix Version/s: (was: 0.12.0) 0.13.0 XMLLoader return Pig tuple/map/bag representation of the DOM of XML documents - Key: PIG-3104 URL: https://issues.apache.org/jira/browse/PIG-3104 Project: Pig Issue Type: Improvement Components: internal-udfs, piggybank Affects Versions: 0.10.0, 0.11 Reporter: Russell Jurney Assignee: Daniel Dai Fix For: 0.13.0 I want to extend Pig's existing XMLLoader to go beyond capturing the text inside a tag and to actually create a Pig mapping of the Document Object Model the XML represents. This would be similar to elephant-bird's JsonLoader. Semi-structured data can vary, so this behavior can be risky but... I want people to be able to load JSON and XML data easily their first session with Pig. --- characters = load 'example.xml' using XMLLoader('character'); describe characters {properties:map[], name:chararray, born:datetime, qualification:chararray} --- book id=b0836217462 available=true isbn 0836217462 /isbn title lang=en Being a Dog Is a Full-Time Job /title author id=CMS name Charles M Schulz /name born 1922-11-26 /born dead 2000-02-12 /dead /author character id=PP name Peppermint Patty /name born 1966-08-22 /born qualification bold, brash and tomboyish /qualification /character character id=Snoopy name Snoopy /name born 1950-10-04 /born qualification extroverted beagle /qualification /character character id=Schroeder name Schroeder /name born 1951-05-30 /born qualification brought classical music to the Peanuts strip /qualification /character character id=Lucy name Lucy /name born 1952-03-03 /born qualification bossy, crabby and selfish /qualification /character /book /library -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3377) New AvroStorage throws NPE when storing untyped map/array/bag
[ https://issues.apache.org/jira/browse/PIG-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778177#comment-13778177 ] Daniel Dai commented on PIG-3377: - [~jadler], are you still working on it? New AvroStorage throws NPE when storing untyped map/array/bag - Key: PIG-3377 URL: https://issues.apache.org/jira/browse/PIG-3377 Project: Pig Issue Type: Bug Components: internal-udfs Reporter: Cheolsoo Park Assignee: Joseph Adler Fix For: 0.12.0 The following example demonstrates the issue: {code} a = LOAD 'foo' AS (m:map[]); STORE a INTO 'bar' USING AvroStorage(); {code} This fails with the following error: {code} java.lang.NullPointerException at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceFieldSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:462) at org.apache.pig.impl.util.avro.AvroStorageSchemaConversionUtilities.resourceSchemaToAvroSchema(AvroStorageSchemaConversionUtilities.java:335) at org.apache.pig.builtin.AvroStorage.checkSchema(AvroStorage.java:472) {code} Similarly, untyped bag causes the following error: {code} Caused by: java.lang.NullPointerException at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:722) ... at org.apache.avro.Schema.getElementType(Schema.java:256) at org.apache.pig.builtin.AvroStorage.setOutputAvroSchema(AvroStorage.java:491) {code} The problem is that AvroStorage cannot derive the output schema from untyped map/bag/tuple. When type is not defined, it should be assumed as bytearray. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3326) Add PiggyBank to Maven Repository
[ https://issues.apache.org/jira/browse/PIG-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3326: Fix Version/s: (was: 0.12.0) 0.13.0 Add PiggyBank to Maven Repository - Key: PIG-3326 URL: https://issues.apache.org/jira/browse/PIG-3326 Project: Pig Issue Type: New Feature Components: piggybank Reporter: Aaron Mitchell Priority: Minor Fix For: 0.13.0 PiggyBank should be uploaded to the apache maven repository. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3254) Fail a failed Pig script quicker
[ https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3254: Fix Version/s: (was: 0.12.0) 0.13.0 Fail a failed Pig script quicker Key: PIG-3254 URL: https://issues.apache.org/jira/browse/PIG-3254 Project: Pig Issue Type: Improvement Reporter: Daniel Dai Fix For: 0.13.0 Credit to [~asitecn]. Currently Pig could launch several mapreduce job simultaneously. When one mapreduce job fail, we need to wait for simultaneous mapreduce job finish. In addition, we could potentially launch additional jobs which is doomed to fail. However, this is unnecessary in some cases: * If stop.on.failure==true, we can kill parallel jobs, and fail the whole script * If stop.on.failure==false, and no store could success, we can also kill parallel jobs, and fail the whole script Consider simultaneous jobs may take a long time to finish, this could significantly improve the turn around in some cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3232) Refactor Pig so that configurations use PigConfiguration wherever possible
[ https://issues.apache.org/jira/browse/PIG-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3232: Fix Version/s: (was: 0.12.0) 0.13.0 Refactor Pig so that configurations use PigConfiguration wherever possible -- Key: PIG-3232 URL: https://issues.apache.org/jira/browse/PIG-3232 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Fix For: 0.13.0 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3143) Enable TOKENIZE to use any configurable Lucene Tokenizer, if a config parameter is set and the JARs included
[ https://issues.apache.org/jira/browse/PIG-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3143: Fix Version/s: (was: 0.12.0) 0.13.0 Enable TOKENIZE to use any configurable Lucene Tokenizer, if a config parameter is set and the JARs included Key: PIG-3143 URL: https://issues.apache.org/jira/browse/PIG-3143 Project: Pig Issue Type: Improvement Components: impl, internal-udfs Affects Versions: 0.11 Reporter: Russell Jurney Fix For: 0.13.0 I'll do this in time for 12. TOKENIZE is literally useless as is. See: http://thedatachef.blogspot.com/2011/04/lucene-text-tokenization-udf-for-apache.html https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3157) Move LENGTH from Piggybank to builtin, make LENGTH work for multiple types similar to SIZE
[ https://issues.apache.org/jira/browse/PIG-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3157: Fix Version/s: (was: 0.12.0) 0.13.0 Move LENGTH from Piggybank to builtin, make LENGTH work for multiple types similar to SIZE -- Key: PIG-3157 URL: https://issues.apache.org/jira/browse/PIG-3157 Project: Pig Issue Type: Improvement Components: internal-udfs, piggybank Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.13.0 LENGTH needs to be a builtin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3133) Revamp algebraic interface to actually return classes
[ https://issues.apache.org/jira/browse/PIG-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3133: Fix Version/s: (was: 0.12.0) 0.13.0 Revamp algebraic interface to actually return classes - Key: PIG-3133 URL: https://issues.apache.org/jira/browse/PIG-3133 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Fix For: 0.13.0 The current algebraic interface is a bit weird to work with. It would make a lot more sense to let people return Class? extends EvalFuncTuple or what have you, or even a FuncSpec, but the current string based approach circumvents the whole point of using Java and is annoying. I think we should have an abstract EFInitial, EFIntermediate, EFFinal which implemented the exec function for the user, but in terms of a simpler, clearer interface. This way if people really want the old way they can, but we can present them something less ugly. This would also be a good time to clarify the contracts of Algebraics and simplify them (the initial function's a tuple which contains a bag which contains 1 tuple is super whack). If anyone wants to work on this let me know because this is the sort of thing I will probably bang out when procrastinating something else. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3146) Can't 'import re' in Pig 0.10/0.10.1: ImportError: No module named re
[ https://issues.apache.org/jira/browse/PIG-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3146: Fix Version/s: (was: 0.12.0) 0.13.0 Can't 'import re' in Pig 0.10/0.10.1: ImportError: No module named re - Key: PIG-3146 URL: https://issues.apache.org/jira/browse/PIG-3146 Project: Pig Issue Type: Bug Affects Versions: 0.10.0, 0.10.1 Reporter: Russell Jurney Fix For: 0.13.0 Caused by: Traceback (most recent call last): File udfs.py, line 20, in module import re ImportError: No module named re -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3176) Pig can't use $HOME in Grunt
[ https://issues.apache.org/jira/browse/PIG-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3176: Fix Version/s: (was: 0.12.0) 0.13.0 Pig can't use $HOME in Grunt Key: PIG-3176 URL: https://issues.apache.org/jira/browse/PIG-3176 Project: Pig Issue Type: Bug Components: grunt, parser Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.13.0 Pig needs to know the user's home directory, to let this easily be set, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3165) sh command cannot run mongo client
[ https://issues.apache.org/jira/browse/PIG-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3165: Fix Version/s: (was: 0.12.0) 0.13.0 sh command cannot run mongo client -- Key: PIG-3165 URL: https://issues.apache.org/jira/browse/PIG-3165 Project: Pig Issue Type: Bug Components: grunt, tools Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Russell Jurney Priority: Critical Fix For: 0.13.0 One often needs to drop an old MongoDB store when replacing it. Ex: store answer into 'mongodb://localhost/agile_data.hourly_from_reply_probs' using MongoStorage(); Before doing that you would likely want to run a mongo command from bash from grunt, to drop it: sh mongo --eval 'db.hourly_from_reply_probs.drop();' However, in this case grunt acts as though the command never returns. Crap! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3177) Fix Pig project SEO so latest, 0.11 docs show when you google things
[ https://issues.apache.org/jira/browse/PIG-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3177: Fix Version/s: (was: 0.12.0) 0.13.0 Fix Pig project SEO so latest, 0.11 docs show when you google things Key: PIG-3177 URL: https://issues.apache.org/jira/browse/PIG-3177 Project: Pig Issue Type: Bug Components: site Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Russell Jurney Priority: Critical Fix For: 0.13.0 http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/piggybank/storage/SequenceFileLoader.html The 0.7.0 docs are what everyone references. FOR POOPS SAKES. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3111) ToAvro to convert any Pig record to an Avro bytearray
[ https://issues.apache.org/jira/browse/PIG-3111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3111: Fix Version/s: (was: 0.12.0) 0.13.0 ToAvro to convert any Pig record to an Avro bytearray - Key: PIG-3111 URL: https://issues.apache.org/jira/browse/PIG-3111 Project: Pig Issue Type: New Feature Components: data, internal-udfs Affects Versions: 0.12.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.13.0 I want to create a ToAvro() builtin that converts arbitrary pig fields, including complex types (bags, tuples, maps) to avro format as bytearrays. This would enable storing Avro records in arbitrary data stores, for example HBaseAvroStorage in PIG-2889 See PIG-2641 for ToJson This points to a greater need for customizable/pluggable serialization that plugin to storefuncs and do serialization independently. For example, we might do these operations: a = load 'my_data' as (some_schema); b = foreach a generate ToJson(*); c = foreach a generate ToAvro(*); store b into 'hbase://JsonValueTable' using HBaseStorage(...); store c into 'hbase://AvroValueTable' using HBaseStorage(...); I'll make a ticket for pluggable serialization separately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3214) New/improved mascot
[ https://issues.apache.org/jira/browse/PIG-3214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3214: Fix Version/s: (was: 0.12.0) 0.13.0 New/improved mascot --- Key: PIG-3214 URL: https://issues.apache.org/jira/browse/PIG-3214 Project: Pig Issue Type: Wish Components: site Affects Versions: 0.11 Reporter: Andrew Musselman Priority: Minor Fix For: 0.13.0 Attachments: apache-pig-14.png, apache-pig-yellow-logo.png, newlogo1.png, newlogo2.png, newlogo3.png, newlogo4.png, newlogo5.png, new_logo_7.png, pig_6.JPG, pig_6_lc_g.JPG, pig-logo-10.png, pig-logo-11.png, pig-logo-12.png, pig-logo-13.png, pig-logo-8a.png, pig-logo-8b.png, pig-logo-9a.png, pig-logo-9b.png, pig_logo_new.png Request to change pig mascot to something more graphically appealing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3227) SearchEngineExtractor does not work for bing
[ https://issues.apache.org/jira/browse/PIG-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3227: Fix Version/s: (was: 0.12.0) 0.13.0 SearchEngineExtractor does not work for bing Key: PIG-3227 URL: https://issues.apache.org/jira/browse/PIG-3227 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.13.0 Attachments: SearchEngineExtractor_Bing.patch org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor Extracts a search engine from a URL, but it does not work for Bing -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3190) Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization
[ https://issues.apache.org/jira/browse/PIG-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3190: Fix Version/s: (was: 0.12.0) 0.13.0 Add LuceneTokenizer and SnowballTokenizer to Pig - useful text tokenization --- Key: PIG-3190 URL: https://issues.apache.org/jira/browse/PIG-3190 Project: Pig Issue Type: Bug Components: internal-udfs Affects Versions: 0.11 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.13.0 Attachments: PIG-3190-2.patch, PIG-3190-3.patch, PIG-3190.patch TOKENIZE is literally useless. The Lucene Standard/Snowball tokenizers in lucene, as used by, varaha is much more useful for actual tasks: https://github.com/Ganglion/varaha/blob/master/src/main/java/varaha/text/TokenizeText.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3188) pig.script.submitted.timestamp not always consistent for jobs launched in a given script
[ https://issues.apache.org/jira/browse/PIG-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3188: Fix Version/s: (was: 0.12.0) 0.13.0 pig.script.submitted.timestamp not always consistent for jobs launched in a given script Key: PIG-3188 URL: https://issues.apache.org/jira/browse/PIG-3188 Project: Pig Issue Type: Bug Reporter: Bill Graham Assignee: Bill Graham Fix For: 0.13.0 {{pig.script.submitted.timestamp}} is set in {{MapReduceLauncher.launchPig()}} when the a MR plan is launched. Some scripts (i.e. those with an exec in the middle) will cause multiple plans to be launched. In these case jobs launched from the same script can have different {{pig.script.submitted.timestamp}} values, which is a bug. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
[ https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3228: Fix Version/s: (was: 0.12.0) 0.13.0 SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.13.0 Attachments: SearchEngineExtractor_Malformed.patch This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions
[ https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3229: Fix Version/s: (was: 0.12.0) 0.13.0 SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions --- Key: PIG-3229 URL: https://issues.apache.org/jira/browse/PIG-3229 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.13.0 Attachments: SearchEngineExtractor_Counter.patch, SearchTermExtractor_Counter.patch SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors The patch for SearchEngineExtractor is really only relevant if the following bug is accepted https://issues.apache.org/jira/browse/PIG-3228 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3469) Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing.
[ https://issues.apache.org/jira/browse/PIG-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jarek Jarcec Cecho reassigned PIG-3469: --- Assignee: Jarek Jarcec Cecho Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing. --- Key: PIG-3469 URL: https://issues.apache.org/jira/browse/PIG-3469 Project: Pig Issue Type: Bug Affects Versions: 0.11 Environment: Apache Pig version 0.11.0-cdh4.4.0 Happens in both local execution environment (os x) and cluster environment (linux) Reporter: Christon DeWan Assignee: Jarek Jarcec Cecho Run this script in the local execution environment (affects cluster mode too): {noformat} %declare DATA_EXISTS /tmp/test_data_exists.tsv %declare DATA_MISSING /tmp/test_data_missing.tsv %declare DUMMY `bash -c '(for (( i=0; \$i 10; i++ )); do echo \$i; done) /tmp/test_data_exists.tsv; true'` exists = LOAD '$DATA_EXISTS' AS (a:long); missing = LOAD '$DATA_MISSING' AS (a:long); missing = FOREACH ( GROUP missing BY a ) GENERATE $0 AS a, COUNT_STAR($1); joined = JOIN exists BY a, missing BY a USING 'skewed'; STORE joined INTO '/tmp/test_out.tsv'; {noformat} Results in NullPointerException which halts entire pig execution, including unrelated jobs. Expected: only dependencies of the error'd LOAD statement should fail. Error: {noformat} 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration. 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:848) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:294) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1266) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251) at org.apache.pig.PigServer.execute(PigServer.java:1241) at org.apache.pig.PigServer.executeBatch(PigServer.java:335) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:868) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:480) ... 17 more {noformat} Script above is as small as I can make it while still reproducing the issue. Removing the group-foreach causes the join to fail harmlessly (not stopping pig execution), as does using the default join. Did not occur on 0.10.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3469) Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing.
[ https://issues.apache.org/jira/browse/PIG-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778225#comment-13778225 ] Jarek Jarcec Cecho commented on PIG-3469: - I believe that I do have understanding of this issue, will upload patch after running all tests. Skewed join can cause unrecoverable NullPointerException when one of its inputs is missing. --- Key: PIG-3469 URL: https://issues.apache.org/jira/browse/PIG-3469 Project: Pig Issue Type: Bug Affects Versions: 0.11 Environment: Apache Pig version 0.11.0-cdh4.4.0 Happens in both local execution environment (os x) and cluster environment (linux) Reporter: Christon DeWan Assignee: Jarek Jarcec Cecho Run this script in the local execution environment (affects cluster mode too): {noformat} %declare DATA_EXISTS /tmp/test_data_exists.tsv %declare DATA_MISSING /tmp/test_data_missing.tsv %declare DUMMY `bash -c '(for (( i=0; \$i 10; i++ )); do echo \$i; done) /tmp/test_data_exists.tsv; true'` exists = LOAD '$DATA_EXISTS' AS (a:long); missing = LOAD '$DATA_MISSING' AS (a:long); missing = FOREACH ( GROUP missing BY a ) GENERATE $0 AS a, COUNT_STAR($1); joined = JOIN exists BY a, missing BY a USING 'skewed'; STORE joined INTO '/tmp/test_out.tsv'; {noformat} Results in NullPointerException which halts entire pig execution, including unrelated jobs. Expected: only dependencies of the error'd LOAD statement should fail. Error: {noformat} 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration. 2013-09-18 11:42:31,518 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:848) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:294) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1266) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1251) at org.apache.pig.PigServer.execute(PigServer.java:1241) at org.apache.pig.PigServer.executeBatch(PigServer.java:335) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.adjustNumReducers(JobControlCompiler.java:868) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:480) ... 17 more {noformat} Script above is as small as I can make it while still reproducing the issue. Removing the group-foreach causes the join to fail harmlessly (not stopping pig execution), as does using the default join. Did not occur on 0.10.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3452) Framework to fail-fast jobs based on exceptions in UDFs (useful for assert)
[ https://issues.apache.org/jira/browse/PIG-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-3452: Fix Version/s: (was: 0.12.0) 0.13.0 Framework to fail-fast jobs based on exceptions in UDFs (useful for assert) --- Key: PIG-3452 URL: https://issues.apache.org/jira/browse/PIG-3452 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.13.0 Idea is to add an Exception type to pig that udfs can throw to indicate unexpected, unrecoverable problem. If we see n number of these exceptions, pig client can kill the job and abort itself. n can be configured via configuration properties at runtime. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3421) Script jars should be added to extra jars instead of pig's job.jar
[ https://issues.apache.org/jira/browse/PIG-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-3421: Fix Version/s: (was: 0.12.0) 0.13.0 Script jars should be added to extra jars instead of pig's job.jar -- Key: PIG-3421 URL: https://issues.apache.org/jira/browse/PIG-3421 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.13.0 Currently, for all the script engines, pig adds script jars to pig's job jar even without consulting the skipJars list. Ideally, we should add these to extraJars so that they can benefit from distributed cache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3427) Columns pruning does not work with DereferenceExpression
[ https://issues.apache.org/jira/browse/PIG-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-3427: Fix Version/s: (was: 0.12.0) 0.13.0 Columns pruning does not work with DereferenceExpression Key: PIG-3427 URL: https://issues.apache.org/jira/browse/PIG-3427 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.13.0 Following script does not push projection- {code} a = load 'something' as (a0, a1); b = group a all; c = foreach b generate COUNT(a.a0); {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3300) Optimize partition filter pushdown
[ https://issues.apache.org/jira/browse/PIG-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi resolved PIG-3300. - Resolution: Duplicate Optimize partition filter pushdown -- Key: PIG-3300 URL: https://issues.apache.org/jira/browse/PIG-3300 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Rohini Palaniswamy Assignee: Aniket Mokashi When there is a AND and OR condition involved with combination of partition and non-partition columns like (pcond1 and npcond1) or (pcond2 and npcond2), push the partition filter as (pcond1 or pcond2) to the LoadFunc. We will still be applying the whole filter condition on the loaded data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2641) Create toJSON function for all complex types: tuples, bags and maps
[ https://issues.apache.org/jira/browse/PIG-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-2641: Fix Version/s: (was: 0.12.0) 0.13.0 Create toJSON function for all complex types: tuples, bags and maps --- Key: PIG-2641 URL: https://issues.apache.org/jira/browse/PIG-2641 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.12.0 Environment: Foggy. Damn foggy. Reporter: Russell Jurney Assignee: Russell Jurney Labels: chararray, fun, happy, input, json, output, pants, pig, piggybank, string, wonderdog Fix For: 0.13.0 Attachments: PIG-2641-2.patch, PIG-2641-3.patch, PIG-2641-4.patch, PIG-2641-5.patch, PIG-2641-6.patch, PIG-2641.patch Original Estimate: 96h Remaining Estimate: 96h It is a travesty that there are no UDFs in Piggybanks that, given an arbitrary Pig datatype, return a JSON string of same. I intend to fix this problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (13 issues) Subscriber: pigdaily Key Summary PIG-3470Print configuration variables in grunt https://issues.apache.org/jira/browse/PIG-3470 PIG-3451EvalFuncT ctor reflection to determine value of type param T is brittle https://issues.apache.org/jira/browse/PIG-3451 PIG-3449Move JobCreationException to org.apache.pig.backend.hadoop.executionengine https://issues.apache.org/jira/browse/PIG-3449 PIG-3441Allow Pig to use default resources from Configuration objects https://issues.apache.org/jira/browse/PIG-3441 PIG-3434Null subexpression in bincond nullifies outer tuple (or bag) https://issues.apache.org/jira/browse/PIG-3434 PIG-3388No support for Regex for row filter in org.apache.pig.backend.hadoop.hbase.HBaseStorage https://issues.apache.org/jira/browse/PIG-3388 PIG-3325Adding a tuple to a bag is slow https://issues.apache.org/jira/browse/PIG-3325 PIG-3292Logical plan invalid state: duplicate uid in schema during self-join to get cross product https://issues.apache.org/jira/browse/PIG-3292 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3117A debug mode in which pig does not delete temporary files https://issues.apache.org/jira/browse/PIG-3117 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3021Split results missing records when there is null values in the column comparison https://issues.apache.org/jira/browse/PIG-3021 PIG-2672Optimize the use of DistributedCache https://issues.apache.org/jira/browse/PIG-2672 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384
[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs
[ https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3370: --- Fix Version/s: (was: 0.13.0) 0.12.0 Add New Reserved Keywords To The Pig Docs - Key: PIG-3370 URL: https://issues.apache.org/jira/browse/PIG-3370 Project: Pig Issue Type: Task Components: documentation, parser Reporter: Sergey Goder Assignee: Cheolsoo Park Priority: Trivial Fix For: 0.12.0 Attachments: PIG-3370-1.patch The following are reserved keywords in Pig that are not included in the 11.1 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords) cube, dense, rank, returns, rollup, void Please add to any that I may have overlooked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs
[ https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3370: --- Status: Patch Available (was: Open) Add New Reserved Keywords To The Pig Docs - Key: PIG-3370 URL: https://issues.apache.org/jira/browse/PIG-3370 Project: Pig Issue Type: Task Components: documentation, parser Reporter: Sergey Goder Assignee: Cheolsoo Park Priority: Trivial Fix For: 0.12.0 Attachments: PIG-3370-1.patch The following are reserved keywords in Pig that are not included in the 11.1 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords) cube, dense, rank, returns, rollup, void Please add to any that I may have overlooked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3370) Add New Reserved Keywords To The Pig Docs
[ https://issues.apache.org/jira/browse/PIG-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3370: --- Attachment: PIG-3370-1.patch The patch adds dense, returns, rollup, and void to the reserved keywords. Add New Reserved Keywords To The Pig Docs - Key: PIG-3370 URL: https://issues.apache.org/jira/browse/PIG-3370 Project: Pig Issue Type: Task Components: documentation, parser Reporter: Sergey Goder Assignee: Cheolsoo Park Priority: Trivial Fix For: 0.13.0 Attachments: PIG-3370-1.patch The following are reserved keywords in Pig that are not included in the 11.1 docs (see http://pig.apache.org/docs/r0.11.1/basic.html#reserved-keywords) cube, dense, rank, returns, rollup, void Please add to any that I may have overlooked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3483) Document ASSERT keyword
Cheolsoo Park created PIG-3483: -- Summary: Document ASSERT keyword Key: PIG-3483 URL: https://issues.apache.org/jira/browse/PIG-3483 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.12.0 Reporter: Cheolsoo Park Fix For: 0.12.0 PIG-3367 added a new keyword ASSERT, so we need to document it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3367) Add assert keyword (operator) in pig
[ https://issues.apache.org/jira/browse/PIG-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778454#comment-13778454 ] Cheolsoo Park commented on PIG-3367: [~aniket486], would you mind updating the documentation - PIG-3483? Add assert keyword (operator) in pig Key: PIG-3367 URL: https://issues.apache.org/jira/browse/PIG-3367 Project: Pig Issue Type: New Feature Components: parser Reporter: Aniket Mokashi Assignee: Aniket Mokashi Fix For: 0.12.0 Attachments: PIG-3367-2.patch, PIG-3367.patch Assert operator can be used for data validation. With assert you can write script as following- {code} a = load 'something' as (a0:int, a1:int); assert a by a0 0, 'a cant be negative for reasons'; {code} This script will fail if assert is violated. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3484) Make the size of pig.script property configurable
Cheolsoo Park created PIG-3484: -- Summary: Make the size of pig.script property configurable Key: PIG-3484 URL: https://issues.apache.org/jira/browse/PIG-3484 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 Some applications (e.g. Lipstick) use the pig.script property to display the script. But since its size is limited by a hard-coded max, it's not always possible to store an entire script. It would be nicer if the size of pig.script is configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3390) Make pig working with HBase 0.95
[ https://issues.apache.org/jira/browse/PIG-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13778475#comment-13778475 ] Ashish Singh commented on PIG-3390: --- looks like this patch only works with hbase compiled against hadoop1. As there is no dependency defined for hbase-hadoop2-compat Make pig working with HBase 0.95 Key: PIG-3390 URL: https://issues.apache.org/jira/browse/PIG-3390 Project: Pig Issue Type: New Feature Affects Versions: 0.11 Reporter: Jarek Jarcec Cecho Assignee: Jarek Jarcec Cecho Fix For: 0.12.0 Attachments: PIG-3390.patch, PIG-3390.patch, PIG-3390.patch The HBase 0.95 changed API in incompatible way. Following APIs that {{HBaseStorage}} in Pig uses are no longer available: * {{Mutation.setWriteToWAL(Boolean)}} * {{Scan.write(DataOutput)}} Also in addition the HBase is no longer available as one monolithic archive with entire functionality, but was broken down into smaller pieces such as {{hbase-client}}, {{hbase-server}}, ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3484) Make the size of pig.script property configurable
[ https://issues.apache.org/jira/browse/PIG-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3484: --- Attachment: PIG-3484-1.patch The attached patch adds a new property pig.script.max.size. The default value is 10,240. Make the size of pig.script property configurable - Key: PIG-3484 URL: https://issues.apache.org/jira/browse/PIG-3484 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 Attachments: PIG-3484-1.patch Some applications (e.g. Lipstick) use the pig.script property to display the script. But since its size is limited by a hard-coded max, it's not always possible to store an entire script. It would be nicer if the size of pig.script is configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3484) Make the size of pig.script property configurable
[ https://issues.apache.org/jira/browse/PIG-3484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3484: --- Status: Patch Available (was: Open) Make the size of pig.script property configurable - Key: PIG-3484 URL: https://issues.apache.org/jira/browse/PIG-3484 Project: Pig Issue Type: Improvement Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 Attachments: PIG-3484-1.patch Some applications (e.g. Lipstick) use the pig.script property to display the script. But since its size is limited by a hard-coded max, it's not always possible to store an entire script. It would be nicer if the size of pig.script is configurable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3485) Remove CastUtils.bytesToMap() method from LoadCaster interface
Cheolsoo Park created PIG-3485: -- Summary: Remove CastUtils.bytesToMap() method from LoadCaster interface Key: PIG-3485 URL: https://issues.apache.org/jira/browse/PIG-3485 Project: Pig Issue Type: Task Components: impl Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.13.0 PIG-1876 added typed map and annotated the following method as {{deprecated}} in 0.9: {code} @Deprecated public MapString, Object bytesToMap(byte[] b) throws IOException; {code} We should remove and replace it with the new method that takes type information: {code} public MapString, Object bytesToMap(byte[] b, ResourceFieldSchema fieldSchema) throws IOException; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira