[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-22 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3015:
---

Attachment: PIG-3015-2.patch

Hi Joe,

I am attaching a patch that replaces {{createtests.py}} with Java code in 
{{TestAvroStorage.java}}. I did this for two reasons:
* Problem with generating {{evenFileNameTestDirectoryCounts.avro}}. Using Avro 
Tool API addresses it.
* Better integration with JUnit. Now we can run the unit test with a single 
command ({{ant clean test -Dtestcase=TestAvroStorage}}). This auto-generates 
and delete input files before and after each test run.

Please let me know what you think.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-22 Thread jira
Issue Subscription
Filter: PIG patch available (38 issues)

Subscriber: pigdaily

Key Summary
PIG-3098Add another test for the self join case
https://issues.apache.org/jira/browse/PIG-3098
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests 
https://issues.apache.org/jira/browse/PIG-3086
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3050Fix FindBugs multithreading warnings
https://issues.apache.org/jira/browse/PIG-3050
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2788improved string interpolation of variables
https://issues.apache.org/jira/browse/PIG-2788
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema 

Re: Our release process

2012-12-22 Thread Santhosh M S
Hi Olga,

Daniel has cut the 0.10.1 branch and the release candidate is up for a vote. I 
think its reasonable to expect such releases where there are a lot of bug 
fixes. In 0.10.1 there are 48 fixes in all.

The list of bug fixes does not fit the criteria of being having no workaround 
or failing silently.

Does that work for? 


Santhosh



 From: Olga Natkovich onatkov...@yahoo.com
To: dev@pig.apache.org dev@pig.apache.org 
Sent: Monday, December 17, 2012 4:16 PM
Subject: Re: Our release process
 
Hi Jonathan,

I thought I answered your email last week but I just noticed that the answer 
did not come through.

We tell users that at is coming in the next release. Now that Pig is quite 
mature and stable, we don't see much of this. Having more frequent releases 
definitely helps in this respect.

Olga






From: Jonathan Coveney jcove...@gmail.com
To: dev@pig.apache.org dev@pig.apache.org; Olga Natkovich 
onatkov...@yahoo.com 
Sent: Thursday, December 13, 2012 1:14 PM
Subject: Re: Our release process

Olga,

A related but separate question: what do y'all do when there is a feature
that is finished, but for an upcoming release? ie a feature in trunk, but
not in 0.11 (which, let us assume, is stable).

Jon


2012/12/13 Olga Natkovich onatkov...@yahoo.com

 Hi Julien,

 I think for us at Yahoo to be able to run our releases directly from the
 branch we would need the guarantees that I proposed in my initial email and
 something that we agreed to last year. The only changes that go in are

 - Failures without reasonable workarounds
 - Silent failures.

 My main concerns with the proposal is that I do not believe that our
 current testing infra is robust/inclusive enough to catch errors. That's
 why I am hesitant in widening the scope.

 I am fine with whatever the outcome the majority of people agrees with. I
 am just saying that Yahoo will likely need a private branch if our rules
 are too relaxed.

 Olga



 - Original Message -
 From: Julien Le Dem jul...@twitter.com
 To: dev@pig.apache.org dev@pig.apache.org; Olga Natkovich 
 onatkov...@yahoo.com
 Cc:
 Sent: Wednesday, December 12, 2012 4:54 PM
 Subject: Re: Our release process

 Agreed. The priority of a change is subjective as well.
 My definition for inclusion on the release branch:
 - Only bug fixes.
 - Only if they have fairly understood repercussions (up to the committers
 who +/-1 as usual).
 - If we thought it would not break things but still does (CI or externally
 reported failure) we revert it.
 What do you want to add/change? Please reformulate those rules the way you
 like and let's see how we can converge.
 (Also, let's keep it short for clarity)

 Julien

 On Wed, Dec 12, 2012 at 11:08 AM, Olga Natkovich onatkov...@yahoo.com
 wrote:

  Hi Julien,
 
  I understand what you are trying to do and I can see that being able to
  make more fixes post release has value for some use cases. My concern is
  that things that do not destabilize the branch is fairly subjective and
  also not always easy to ascertain beyond trivial changes. The only way I
  know to keep a code stable is to limit the updates. Also we need to
 clearly
  state what the constrains are for a post release commits so that every
 user
  can decide whether it works for them.
 
  Olga
 
 
  
  From: Julien Le Dem jul...@twitter.com
  To: dev@pig.apache.org dev@pig.apache.org
  Sent: Wednesday, December 12, 2012 10:26 AM
  Subject: Re: Our release process
 
  I think we all agree here, let's not jump to conclusions.
  Everything in this branch I am talking about is in Apache Pig. Everything
  we do in Pig is contributed.
  We have a branch for 0.11 where we keep merging the official 0.11 branch
  plus a few patches (and it will stay small) that are only in Apache
 TRUNK.
  The goal here is to help keeping the release branch stable by not adding
  patches that are only useful to us.
  Having this branch allows us to fix anything quickly and redeploy to
  production. It is also what allows us to use the pig 0.11 branch in
  production before it is even released.
  This definitely benefits the community and helps making 0.11 stable.
  This is a very reasonable way to keep using a recent version of Pig in
  production.
 
  Olga: My goal is to decrease the scope of what is going in the release
  branch and to make sure we add only bug fixes that are not making it
  unstable. I also think having a short definition of this helps which is
 why
  I have been chiming in.
  Let us know how you want to decrease the scope. I'm just trying to
 simplify
  here.
 
  Julien
 
 
 
  On Tue, Dec 11, 2012 at 8:54 AM, Prashant Kommireddi 
 prash1...@gmail.com
  wrote:
 
   Share the same concern as Russell here. Not great for the project for
   everyone to go private branch approach.
  
   On Tue, Dec 11, 2012 at 8:33 AM, Russell Jurney 
  russell.jur...@gmail.com
   wrote:
  
Wait. Ack. Do we 

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2012-12-22 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3015:
---

Attachment: PIG-3015-3.patch

I am attaching a new patch that includes the following fixes:
# {{hadoop-core.jar}} is pulled as it's a dependency of {{trevni-core.jar}}. 
This causes compile errors with hadoop-2.0.x. I fixed this as follows:
{code}
+dependency org=org.apache.avro name=trevni-avro rev=${avro.version}
+  conf=compile-default;checkstyle-master
+  exclude org=org.apache.hadoop module=hadoop-core/
+/dependency
{code}
# {{avro-tools.jar}} contains hadoop classes such as Configuration. This causes 
compile errors with hadoop-2.0.x. I fixed this as follows:
{code}
+dependency org=org.apache.avro name=avro-tools rev=${avro.version}
+  conf=test-default
+  artifact name=nodeps type=jar/
+/dependency
{code}

 Now I can run {{TestAvroStorage}} with {{-Dhadoopversion=23}}, but all test 
cases currently fail. I haven't investigated yet.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira