[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791612#action_12791612 ] Hadoop QA commented on PIG-760: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428185/pigstorageschema_7.patch against trunk revision 890596. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/131/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/131/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/131/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch, pigstorageschema_7.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791079#action_12791079 ] Alan Gates commented on PIG-760: I reran the test, it looks fine. Given that LoadMetadata et al has moved to experimental and PigStorageSchema to PiggyBank it doesn't seem to me that JsonMetadata belongs in builtin. Are you ok with me moving it to experimental as part of applying the patch? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788265#action_12788265 ] Hadoop QA commented on PIG-760: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427459/pigstorageschema_5.patch against trunk revision 888704. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/109/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/109/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/109/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788300#action_12788300 ] Dmitriy V. Ryaboy commented on PIG-760: --- The core test failure is in junit.framework -- doesn't seem related. Can someone confirm this is just Hudson acting out? Here's the error report: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/109/testReport/junit.framework/TestSuite$1/warning/ The javadoc fix is trivial, holding off uploading a patch in case I need to do something about the junit test failure. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, pigstorageschema_5.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784441#action_12784441 ] Dmitriy V. Ryaboy commented on PIG-760: --- Right.. now that PigStorageSchema is in the piggybank, I need to use the full package name when referring to it. I'll update this ticket with a fix, and move the new interfaces to experimental. This probably won't happen until next week. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch, TEST-org.apache.pig.piggybank.test.TestPigStorageSchema.txt I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783890#action_12783890 ] Alan Gates commented on PIG-760: Is this patch ready to be reviewed with and checked in, or is it still in the development stages? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783911#action_12783911 ] Dmitriy V. Ryaboy commented on PIG-760: --- Ready Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783979#action_12783979 ] Alan Gates commented on PIG-760: Question for other pig committers: Dmitry proposes with this patch to include and begin using some of the load-store redesign changes (see PIG-966). Specifically, he includes versions of ResourceSchema, ResourceStatistcs, LoadMetadata, and StoreMetadata. Currently these are also being implemented on the load-store-redesign branch with the assumption that they'll be rolled into trunk for the 0.7 (or possibly a later) release. He wants to include these new classes in this patch because he is using it for the cost based optimizer he is working on. Are we ok with introducing these classes now since we know they are still under development and thus not yet stable? I am if it is done with the stipulation that they will certainly change before they are officially released. To make this clear to developers, I suggest moving them into a package org.apache.pig.experimental to make clear that fact they are not yet stable. Thoughts? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783995#action_12783995 ] Pradeep Kamath commented on PIG-760: I agree that these classes should go into org.apache.pig.experimental as part of this patch - the only issue I see is when eventually the load-store-redesign branch's changes are merged back to trunk the code in this patch will need to use the right package to refer to the classes implemented in core pig - with this patch committed we will need a reminder to make this change later - do we need a companion jira to track that? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783188#action_12783188 ] Hadoop QA commented on PIG-760: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426297/pigstorageschema_4.patch against trunk revision 884235. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/64/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/64/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/64/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch, pigstorageschema_4.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782978#action_12782978 ] Hadoop QA commented on PIG-760: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426186/pigstorageschema_3.patch against trunk revision 884235. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/61/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/61/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782759#action_12782759 ] Dmitriy V. Ryaboy commented on PIG-760: --- Argh. I totally missed Alan's comment before creating this new patch. To address Alan's concerns: the patch doesn't break any current interfaces, just adds a few new ones, which aren't public. I can commit to updating this issue as the definitions of the interfaces change. Where my implementation of the spec differs from the code in the load-store redesign branch, I believe it's because I am actually writing code that uses stats and schema loaders, and these changes should be considered for the branch. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch, pigstorageschema_3.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776065#action_12776065 ] Dmitriy V. Ryaboy commented on PIG-760: --- Alan et al -- Would this to be considered for trunk and 0.6, assuming I bring it up to parity with the interfaces as they are currently defined in the Load/Store redesign branch? Or do you want to wait for the redesign to be complete before this can go in (naturally, I prefer the former)? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776086#action_12776086 ] Alan Gates commented on PIG-760: The issue is we want to break interfaces once, so we don't want to introduce any of the interfaces now. The load/store redesign obviously won't be going into 0.6. The other issue is that any classes and interfaces introduced now in the redesign are inherently unstable. So even if we just sneak in ResourceSchema and ResourceStatistics, which won't break anything, I doubt they'll look the same once the redesign is done. And I certainly don't want to be bound to any backward compatibility for those classes between 0.6 and the redesign. I suggest that you build your own version of these classes and use them in your load/store functions and your optimizer. Then when the redesign comes out, your code can switch. As we'd change the classes anyway, I don't think you're creating any extra work for yourself. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770573#action_12770573 ] Alan Gates commented on PIG-760: I know I'm wandering dangerously close to being fanatical here, but I really dislike taking a struct, making all the members private/protected, and then adding getters and setters. If some tools need getters and setters, feel free to add them. But please leave the members public. I notice you snuck in your names for LoadMetadata and StoreMetadata. I'm fine with motions to change the names. But let's get everyone to agree on the new names before we start using them. On the StoreMetadata interface, Pradeep had some thoughts on getting rid of it, as he felt all the necessary information could be communicated in StoreFunc.allFinished(). He should be publishing an update to the load/store redesign wiki ( http://wiki.apache.org/pig/LoadStoreRedesignProposal ) soon. He also wanted to change LoadMetadata.getSchema() to take a location so that the loader could find the file. Other changes all look good. One general thought. I want to figure out how to keep the ResourceStatistics object flexible enough that it's easy to add new statistics to it. One thought I'd had previously (I can't remember if we discussed this or not) was to add a MapString, Object to it. That way we can add new stats between versions of the object. Once the stats are accepted as valid and take hold, they could be moved into the object proper. Upside of this is its flexible. Downside is we risk devolving into an unknown properties object and every stat has to go through a transition. Thoughts? Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770618#action_12770618 ] Dmitriy V. Ryaboy commented on PIG-760: --- bq. If some tools need getters and setters, feel free to add them. But please leave the members public. I'll revert the change. bq. I notice you snuck in your names for LoadMetadata and StoreMetadata. I'm fine with motions to change the names. But let's get everyone to agree on the new names before we start using them. Yeah I kind of figured we'll get to discuss again if I do that :-). It seems like we didn't really reach a final decision last time. Are we sure the only time it might be reasonable to read or write metadata are during Loads and Stores? I am not. I can envision future uses where the storage is some ephemeral state that we have operating reporting stats into to enable adaptive optimizations. Also, and I know I am nitpicking, LoadMetadata is an instruction, where's MetadataLoader is a thing. Same with StoreMetadata and MetadataStorer (but storer isn't a real word so I chose Writer..). bq. On the StoreMetadata interface, Pradeep had some thoughts on getting rid of it, as he felt all the necessary information could be communicated in StoreFunc.allFinished(). He should be publishing an update to the load/store redesign wiki ( http://wiki.apache.org/pig/LoadStoreRedesignProposal ) soon. I was envisioning the setStatistics() and setSchema() methods as methods used to alter state, whereas allFinished() essentially does the job of flushing whatever is needed (you'll notice I fake an allFinished() method in my finish() implementation by simply checking if any other task has started creating the necessary file yet -- a suboptimal workaround, but the best that can be done with the current interface). bq. He also wanted to change LoadMetadata.getSchema() to take a location so that the loader could find the file. A location by itself my not be sufficient -- for example for the JsonMetadata implementation, I need the DataStorage as well. I solved that by passing the location and storage into JsonMetadata's constructor. There is something to be said for being able to reuse the same MetadataLoader to load schemas for multiple locations, however. Assuming we can't come up with any scenarios where by the time we need to get the schema, we no longer have the location -- but we might have created the MetadataLoader beforehand, and set the internal location at that time -- I agree with the change. bq. One thought I'd had previously (I can't remember if we discussed this or not) was to add a MapString, Object to it I have a feeling we did discuss this, or something like this, possibly in a different context, but I can't find the mention either. I am not sure what we would gain by this -- the only consumers of stats would be various optimizers/compilers/translators, right? So they would need to be updated to deal with new stats, and code that propagates / estimates stats down a logical plan would need to get updated, whenever a new statistic is added. That sounds pretty extensive. If we instead assume that any field is nullable (or, if collection, can be empty), and make sure that all missing fields are filled in as nulls/empties when the stat objects are deserialized, we should be ok with upgrades. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769141#action_12769141 ] Hadoop QA commented on PIG-760: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422958/pigstorageschema-2.patch against trunk revision 828891. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 403 javac compiler warnings (more than the trunk's current 401 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/112/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/112/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/112/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769435#action_12769435 ] Dmitriy V. Ryaboy commented on PIG-760: --- The Javac warnings are about JobConf deprecation. There is a separate patch in the queue to turn this warning off until migration to the new APIs is finished. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768956#action_12768956 ] Dmitriy V. Ryaboy commented on PIG-760: --- David, If / when I get complex schemas to work, this could theoretically be promoted to PigStorage proper, which would be cool. For now, if you try to deserialize a complex schema, everything blows up.. So that's not so good (especially since I let you serialize complex schemas! Actually maybe I should turn that off). I'll add some docs on the next iteration, good call. Briefly -- it's a JSON representation of the ResourceSchema, as described on the LoadStore redesign proposal: http://wiki.apache.org/pig/LoadStoreRedesignProposal . Once you know what the fields are, it's pretty easy to read; the one complexity is that types are represented using constants from the DataType class, which are not publicly documented. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Dmitriy V. Ryaboy Fix For: 0.6.0 Attachments: pigstorageschema-2.patch, pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767862#action_12767862 ] Alan Gates commented on PIG-760: I don't take javac or findbugs warnings as final truth. If you can give a good reason why the warning is wrong, not relevant, or you've chosen to take that risk to get some other benefit (such as you're not doing instanceof before a cast for performance and you believe the risk acceptable) then put that in comments and suppress the warning in the code. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Attachments: pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767605#action_12767605 ] Hadoop QA commented on PIG-760: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422565/pigstorageschema.patch against trunk revision 826110. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 410 javac compiler warnings (more than the trunk's current 408 warnings). -1 findbugs. The patch appears to introduce 15 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/101/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/101/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/101/console This message is automatically generated. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Attachments: pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767651#action_12767651 ] Dmitriy V. Ryaboy commented on PIG-760: --- Hmm.. I'll check out the javac warnings. What do I do if I disagree with 14 out of 15 Findbugs warnings? It wants me to copy objects in getters/setters, but I don't think that's necessary for this case. Commiters? Also -- if someone gave me the ability to assign Jiras to myself, that would be great... Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Attachments: pigstorageschema.patch I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765603#action_12765603 ] Alan Gates commented on PIG-760: At this point no one has contributed a PigStorageSchema as suggested above. We remain open to such a contribution if someone has the time. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765626#action_12765626 ] Dmitriy V. Ryaboy commented on PIG-760: --- This would be a nice proof-of-concept task for the new Load/StoreMetadata interfaces, as it removes the complexity of dealing with something like Owl. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763294#action_12763294 ] mark meissonnier commented on PIG-760: -- Any new development on this issue? I'm finding it painful to have to modify the input schema to all child pig scripts anytime I modify my root pig script. I was thinking of developing something quick and then I figured someone might have done something or I could help the overall effort. Please let me know. Thanks Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697631#action_12697631 ] David Ciemiewicz commented on PIG-760: -- Sure, you could do that, create PigStorageSchema. The thing is, I don't think it is necessary and it is possible to do this in a backward compatible way. First, if the user specifies a LOAD ... AS clause schema, then PigStorage could simply use that casting to override what is in the .schema. Of course, PigStorage might want to warn that there is an override at run time or do a smart warning only if there are incompatible differences between the serialized schema and the explicit AS clause schema. Next, is there really any harm in creating the serialized shema file on each and every STORE? Finally, why sub class when we could parameterize? In other words, instead of writing: store A into 'file' using PigStorageSchema(); Why not do: store A into 'file' using PigStorage('schema=yes'); -- redundant schema=yes is default I think it would be more useful to have single classes with parameterized options than a proliferation of classes. Or, better yet, why can't I just define the behavior of PigStorage() for all of the instances in my script: define PigStorage PigStorage( 'sep=\t', 'schema=yes', 'erroronmissingcolumn=no' ); I have recently done similar things for other functions and it turns out to be a nice way of capturing global parameterizations for cleaner Pig code. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.