[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3015: --- Attachment: PIG-3015-2.patch Hi Joe, I am attaching a patch that replaces {{createtests.py}} with Java code in {{TestAvroStorage.java}}. I did this for two reasons: * Problem with generating {{evenFileNameTestDirectoryCounts.avro}}. Using Avro Tool API addresses it. * Better integration with JUnit. Now we can run the unit test with a single command ({{ant clean test -Dtestcase=TestAvroStorage}}). This auto-generates and delete input files before and after each test run. Please let me know what you think. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (38 issues) Subscriber: pigdaily Key Summary PIG-3098Add another test for the self join case https://issues.apache.org/jira/browse/PIG-3098 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests https://issues.apache.org/jira/browse/PIG-3086 PIG-3078Make a UDF that, given a string, returns just the columns prefixed by that string https://issues.apache.org/jira/browse/PIG-3078 PIG-3073POUserFunc creating log spam for large scripts https://issues.apache.org/jira/browse/PIG-3073 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3067HBaseStorage should be split up to become more managable https://issues.apache.org/jira/browse/PIG-3067 PIG-3066Fix TestPigRunner in trunk https://issues.apache.org/jira/browse/PIG-3066 PIG-3057make readField protected to be able to override it if we extend PigStorage https://issues.apache.org/jira/browse/PIG-3057 PIG-3051java.lang.IndexOutOfBoundsException failure with LimitOptimizer + ColumnPruning https://issues.apache.org/jira/browse/PIG-3051 PIG-3050Fix FindBugs multithreading warnings https://issues.apache.org/jira/browse/PIG-3050 PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for cross-platform execution https://issues.apache.org/jira/browse/PIG-3029 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline script needs simplification https://issues.apache.org/jira/browse/PIG-3025 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2957TetsScriptUDF fail due to volume prefix in jar https://issues.apache.org/jira/browse/PIG-2957 PIG-2956Invalid cache specification for some streaming statement https://issues.apache.org/jira/browse/PIG-2956 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function returns a Boolean value indicating whether string left is equal to string right. This check is case insensitive. https://issues.apache.org/jira/browse/PIG-2878 PIG-2873Converting bin/pig shell script to python https://issues.apache.org/jira/browse/PIG-2873 PIG-2834MultiStorage requires unused constructor argument https://issues.apache.org/jira/browse/PIG-2834 PIG-2824Pushing checking number of fields into LoadFunc https://issues.apache.org/jira/browse/PIG-2824 PIG-2788improved string interpolation of variables https://issues.apache.org/jira/browse/PIG-2788 PIG-2661Pig uses an extra job for loading data in Pigmix L9 https://issues.apache.org/jira/browse/PIG-2661 PIG-2645PigSplit does not handle the case where SerializationFactory returns null https://issues.apache.org/jira/browse/PIG-2645 PIG-2614AvroStorage crashes on LOADING a single bad error https://issues.apache.org/jira/browse/PIG-2614 PIG-2507Semicolon in paramenters for UDF results in parsing error https://issues.apache.org/jira/browse/PIG-2507 PIG-2433Jython import module not working if module path is in classpath https://issues.apache.org/jira/browse/PIG-2433 PIG-2417Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. https://issues.apache.org/jira/browse/PIG-2417 PIG-2362Rework Ant build.xml to use macrodef instead of antcall https://issues.apache.org/jira/browse/PIG-2362 PIG-2312NPE when relation and column share the same name and used in Nested Foreach https://issues.apache.org/jira/browse/PIG-2312 PIG-1942script UDF (jython) should utilize the intended output schema
Re: Our release process
Hi Olga, Daniel has cut the 0.10.1 branch and the release candidate is up for a vote. I think its reasonable to expect such releases where there are a lot of bug fixes. In 0.10.1 there are 48 fixes in all. The list of bug fixes does not fit the criteria of being having no workaround or failing silently. Does that work for? Santhosh From: Olga Natkovich onatkov...@yahoo.com To: dev@pig.apache.org dev@pig.apache.org Sent: Monday, December 17, 2012 4:16 PM Subject: Re: Our release process Hi Jonathan, I thought I answered your email last week but I just noticed that the answer did not come through. We tell users that at is coming in the next release. Now that Pig is quite mature and stable, we don't see much of this. Having more frequent releases definitely helps in this respect. Olga From: Jonathan Coveney jcove...@gmail.com To: dev@pig.apache.org dev@pig.apache.org; Olga Natkovich onatkov...@yahoo.com Sent: Thursday, December 13, 2012 1:14 PM Subject: Re: Our release process Olga, A related but separate question: what do y'all do when there is a feature that is finished, but for an upcoming release? ie a feature in trunk, but not in 0.11 (which, let us assume, is stable). Jon 2012/12/13 Olga Natkovich onatkov...@yahoo.com Hi Julien, I think for us at Yahoo to be able to run our releases directly from the branch we would need the guarantees that I proposed in my initial email and something that we agreed to last year. The only changes that go in are - Failures without reasonable workarounds - Silent failures. My main concerns with the proposal is that I do not believe that our current testing infra is robust/inclusive enough to catch errors. That's why I am hesitant in widening the scope. I am fine with whatever the outcome the majority of people agrees with. I am just saying that Yahoo will likely need a private branch if our rules are too relaxed. Olga - Original Message - From: Julien Le Dem jul...@twitter.com To: dev@pig.apache.org dev@pig.apache.org; Olga Natkovich onatkov...@yahoo.com Cc: Sent: Wednesday, December 12, 2012 4:54 PM Subject: Re: Our release process Agreed. The priority of a change is subjective as well. My definition for inclusion on the release branch: - Only bug fixes. - Only if they have fairly understood repercussions (up to the committers who +/-1 as usual). - If we thought it would not break things but still does (CI or externally reported failure) we revert it. What do you want to add/change? Please reformulate those rules the way you like and let's see how we can converge. (Also, let's keep it short for clarity) Julien On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich onatkov...@yahoo.com wrote: Hi Julien, I understand what you are trying to do and I can see that being able to make more fixes post release has value for some use cases. My concern is that things that do not destabilize the branch is fairly subjective and also not always easy to ascertain beyond trivial changes. The only way I know to keep a code stable is to limit the updates. Also we need to clearly state what the constrains are for a post release commits so that every user can decide whether it works for them. Olga From: Julien Le Dem jul...@twitter.com To: dev@pig.apache.org dev@pig.apache.org Sent: Wednesday, December 12, 2012 10:26 AM Subject: Re: Our release process I think we all agree here, let's not jump to conclusions. Everything in this branch I am talking about is in Apache Pig. Everything we do in Pig is contributed. We have a branch for 0.11 where we keep merging the official 0.11 branch plus a few patches (and it will stay small) that are only in Apache TRUNK. The goal here is to help keeping the release branch stable by not adding patches that are only useful to us. Having this branch allows us to fix anything quickly and redeploy to production. It is also what allows us to use the pig 0.11 branch in production before it is even released. This definitely benefits the community and helps making 0.11 stable. This is a very reasonable way to keep using a recent version of Pig in production. Olga: My goal is to decrease the scope of what is going in the release branch and to make sure we add only bug fixes that are not making it unstable. I also think having a short definition of this helps which is why I have been chiming in. Let us know how you want to decrease the scope. I'm just trying to simplify here. Julien On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi prash1...@gmail.com wrote: Share the same concern as Russell here. Not great for the project for everyone to go private branch approach. On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney russell.jur...@gmail.com wrote: Wait. Ack. Do we
[jira] [Updated] (PIG-3015) Rewrite of AvroStorage
[ https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3015: --- Attachment: PIG-3015-3.patch I am attaching a new patch that includes the following fixes: # {{hadoop-core.jar}} is pulled as it's a dependency of {{trevni-core.jar}}. This causes compile errors with hadoop-2.0.x. I fixed this as follows: {code} +dependency org=org.apache.avro name=trevni-avro rev=${avro.version} + conf=compile-default;checkstyle-master + exclude org=org.apache.hadoop module=hadoop-core/ +/dependency {code} # {{avro-tools.jar}} contains hadoop classes such as Configuration. This causes compile errors with hadoop-2.0.x. I fixed this as follows: {code} +dependency org=org.apache.avro name=avro-tools rev=${avro.version} + conf=test-default + artifact name=nodeps type=jar/ +/dependency {code} Now I can run {{TestAvroStorage}} with {{-Dhadoopversion=23}}, but all test cases currently fail. I haven't investigated yet. Rewrite of AvroStorage -- Key: PIG-3015 URL: https://issues.apache.org/jira/browse/PIG-3015 Project: Pig Issue Type: Improvement Components: piggybank Reporter: Joseph Adler Assignee: Joseph Adler Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015.patch The current AvroStorage implementation has a lot of issues: it requires old versions of Avro, it copies data much more than needed, and it's verbose and complicated. (One pet peeve of mine is that old versions of Avro don't support Snappy compression.) I rewrote AvroStorage from scratch to fix these issues. In early tests, the new implementation is significantly faster, and the code is a lot simpler. Rewriting AvroStorage also enabled me to implement support for Trevni (as TrevniStorage). I'm opening this ticket to facilitate discussion while I figure out the best way to contribute the changes back to Apache. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira