date:20130104

[jira] [Commented] (PIG-2362) Rework Ant build.xml to use macrodef instead of antcall

2013-01-04 Thread Gianmarco De Francisci Morales (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543726#comment-13543726
 ] 

Gianmarco De Francisci Morales commented on PIG-2362:
-

Hi Cheolsoo,

Good catch, thanks. I wasn't familiar with PIG-2748.
I verified that {{ant mvn-jar}} passes.
Also ran the {{eclipse-files}}, {{src-release}}, {{tar-release}} targets and 
verified their output.

+1 to the last patch

 Rework Ant build.xml to use macrodef instead of antcall
 ---

 Key: PIG-2362
 URL: https://issues.apache.org/jira/browse/PIG-2362
 Project: Pig
  Issue Type: Improvement
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
Priority: Minor
 Fix For: 0.12

 Attachments: PIG-2362.10.patch, PIG-2362.1.patch, PIG-2362.2.patch, 
 PIG-2362.3.patch, PIG-2362.4.patch, PIG-2362.5.patch, PIG-2362.6.patch, 
 PIG-2362.7.patch, PIG-2362.8.patch, PIG-2362.9.patch, 
 PIG-2362.9.patch.nowhitespace


 Antcall is evil: http://www.build-doctor.com/2008/03/13/antcall-is-evil/
 We'd better use macrodef and let Ant build a clean dependency graph.
 http://ant.apache.org/manual/Tasks/macrodef.html
 Right now we do like this:
 {code}
 target name=buildAllJars
   antcall target=buildJar
 param name=build.dir value=jar-A/
   /antcall
   antcall target=buildJar
 param name=build.dir value=jar-B/
   /antcall
   antcall target=buildJar
 param name=build.dir value=jar-C/
   /antcall
 /target
 target name=buildJar
   jar destfile=target/${build.dir}.jar basedir=${build.dir}/classfiles/
 /target
 {code}
 But it would be better if we did like this:
 {code}
 target name=buildAllJars
   buildJar build.dir=jar-A/
   buildJar build.dir=jar-B/
   buildJar build.dir=jar-C/
 /target
 macrodef name=buildJar
   attribute name=build.dir/
   jar destfile=target/${build.dir}.jar basedir=${build.dir}/classfiles/
 /macrodef
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

2013-01-04 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543730#comment-13543730
 ] 

Cheolsoo Park commented on PIG-2433:


Hi Rohini,

After applying the patch to trunk, I see the following error in 
TestScriptUDF.testPythonNestedImportClassPath:
{code}
Testcase: testPythonNestedImportClassPath took 0.182 sec
Caused an ERROR
Python Error. Traceback (most recent call last):
  File /home/cheolsoo/workspace/pig-svn/scriptB.py, line 2, in module
import scriptA
  File __pyclasspath__/scriptA.py, line 3, in module
NameError: name 'outputSchema' is not defined
{code}
Does this test pass for you? 

 Jython import module not working if module path is in classpath
 ---

 Key: PIG-2433
 URL: https://issues.apache.org/jira/browse/PIG-2433
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-2433.patch


 This is a hole of PIG-1824. If the path of python module is in classpath, job 
 die with the message could not instantiate 
 'org.apache.pig.scripting.jython.JythonFunction'.
 Here is my observation:
 If the path of python module is in classpath, fileEntry we got in 
 JythonScriptEngine:236 is __pyclasspath__/script$py.class instead of the 
 script itself. Thus we cannot locate the script and skip the script in 
 job.xml. 
 For example:
 {code}
 register 'scriptB.py' using 
 org.apache.pig.scripting.jython.JythonScriptEngine as pig
 A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long);
 B = foreach A generate pig.square(a0);
 dump B;
 scriptB.py:
 #!/usr/bin/python
 import scriptA
 @outputSchema(x:{t:(num:double)})
 def sqrt(number):
  return (number ** .5)
 @outputSchema(x:{t:(num:long)})
 def square(number):
  return long(scriptA.square(number))
 scriptA.py:
 #!/usr/bin/python
 def square(number):
  return (number * number)
 {code}
 When we register scriptB.py, we use jython library to figure out the 
 dependent modules scriptB relies on, in this case, scriptA. However, if 
 current directory is in classpath, instead of scriptA.py, we get 
 __pyclasspath__/scriptA.class. Then we try to put 
 __pyclasspath__/script$py.class into job.jar, Pig complains 
 __pyclasspath__/script$py.class does not exist. 
 This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 
 20.x, the test still success because MiniCluster will take local classpath so 
 it can still find scriptA.py even if it is not in job.jar. However, the 
 script will fail in real cluster and MiniMRYarnCluster of hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-2362) Rework Ant build.xml to use macrodef instead of antcall

2013-01-04 Thread Cheolsoo Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park resolved PIG-2362.


Resolution: Fixed

Committed to trunk. Thanks Gianmarco!

 Rework Ant build.xml to use macrodef instead of antcall
 ---

 Key: PIG-2362
 URL: https://issues.apache.org/jira/browse/PIG-2362
 Project: Pig
  Issue Type: Improvement
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
Priority: Minor
 Fix For: 0.12

 Attachments: PIG-2362.10.patch, PIG-2362.1.patch, PIG-2362.2.patch, 
 PIG-2362.3.patch, PIG-2362.4.patch, PIG-2362.5.patch, PIG-2362.6.patch, 
 PIG-2362.7.patch, PIG-2362.8.patch, PIG-2362.9.patch, 
 PIG-2362.9.patch.nowhitespace


 Antcall is evil: http://www.build-doctor.com/2008/03/13/antcall-is-evil/
 We'd better use macrodef and let Ant build a clean dependency graph.
 http://ant.apache.org/manual/Tasks/macrodef.html
 Right now we do like this:
 {code}
 target name=buildAllJars
   antcall target=buildJar
 param name=build.dir value=jar-A/
   /antcall
   antcall target=buildJar
 param name=build.dir value=jar-B/
   /antcall
   antcall target=buildJar
 param name=build.dir value=jar-C/
   /antcall
 /target
 target name=buildJar
   jar destfile=target/${build.dir}.jar basedir=${build.dir}/classfiles/
 /target
 {code}
 But it would be better if we did like this:
 {code}
 target name=buildAllJars
   buildJar build.dir=jar-A/
   buildJar build.dir=jar-B/
   buildJar build.dir=jar-C/
 /target
 macrodef name=buildJar
   attribute name=build.dir/
   jar destfile=target/${build.dir}.jar basedir=${build.dir}/classfiles/
 /macrodef
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [VOTE] Release Pig 0.10.1 (candidate 3)

2013-01-04 Thread Cheolsoo Park

+1.

Tested against hadoop 1.0.x and 2.0.x clusters.


On Fri, Jan 4, 2013 at 12:14 AM, Jarek Jarcec Cecho jar...@apache.orgwrote:

 +1 (non-binding)

 * Verified checksum
 * Verified signatures
 * Tests seems to be passing
 * Checked top level files (NOTICE, LICENSE)

 Note:
 I personally prefer when all third party jars are explicitly mentioned in
 LICENSE file as is recommend in [1] (and used for example in [2], [3],
 [4]), but that is not required.

 Jarcec

 Links:
 1:
 http://incubator.apache.org/guides/releasemanagement.html#best-practice-license
 2: http://incubator.apache.org/guides/examples/LICENSE
 3:
 https://git-wip-us.apache.org/repos/asf?p=sqoop.git;a=blob;f=LICENSE.txt;h=00b52964892971fea9280a6201b7efe11a2527e5;hb=sqoop2
 4:
 https://git-wip-us.apache.org/repos/asf?p=flume.git;a=blob;f=LICENSE;h=04c1baf1e0aedfca63f81cff2d64593d8ab55f09;hb=58173b8983027124a61783b4326dee3347ab7552

 On Thu, Jan 03, 2013 at 06:16:33PM -0800, Thejas Nair wrote:
  +1
  Verified md5 checksums of src and binary tar.gz .
  Build the src tar.gz and ran queries against a hadoop 1.1 cluster,
  ran fs and sh commands.
 
  -Thejas
 
 
  On 1/3/13 12:11 PM, Rohini Palaniswamy wrote:
  +1. Downloaded the tar binary, checked signature, ran unit tests,
 piggybank
  unit tests, checked docs/release notes, ran a simple script locally and
  against a cluster.
  
  
  On Mon, Dec 31, 2012 at 8:41 AM, Alan Gates ga...@hortonworks.com
 wrote:
  
  +1, yet again :).  Checked the key signature and checksum on the source
  package.   Built and ran commit unit tests on src, ran a test job in
 local
  mode.  Downloaded the tar binary and ran a job in local and cluster
 mode.
  
  Alan.
  
  On Dec 28, 2012, at 11:50 PM, Daniel Dai wrote:
  
  Hi,
  
  I have created a candidate build for Pig 0.10.1. This is a maintenance
  release of Pig 0.10.
  
  Keys used to sign the release are available at
  http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup
  
  Please download, test, and try it out:
  
  http://people.apache.org/~daijy/pig-0.10.1-candidate-3/
  
  Should we release this? Vote closes on EOD next Friday, Jan 4th.
  
  Thanks,
  Daniel

[jira] [Commented] (PIG-3108) HBaseStorage returns empty maps when mixing wildcard- with other columns

2013-01-04 Thread Christoph Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13543751#comment-13543751
 ] 

Christoph Bauer commented on PIG-3108:
--

Sorry. I've been with this code too long. I will try to explain.

addFiltersWithColumnPrefix and  addFiltersWithoutColumnPrefix actually do
different things:

   - addFiltersWithColumnPrefix creates HBase scan filters
   - addFiltersWithoutColumnPrefix tells the scan object the families and
   columns to retrieve. This is much quicker than adding filters, thats why it
   was changed.

The thing is: the scan object should always be limited to the
family/columns needed to speed things up. In fact we did this already - see
setLocation (it's basicly the same as in addFiltersWithoutColumnPrefix).

So what I did was to replace the code in setLocation with a call to
addFiltersWithoutColumnPrefix


To make things clear, we could remove addFiltersWithoutColumnPrefix  from
the if/else in initScan() ()setLocation will called anyway) and rename it
to setScanColumns or something.



2013/1/3 Bill Graham (JIRA) j...@apache.org



 HBaseStorage returns empty maps when mixing wildcard- with other columns
 

 Key: PIG-3108
 URL: https://issues.apache.org/jira/browse/PIG-3108
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.0, 0.9.1, 0.9.2, 0.10.0, 0.11, 0.10.1, 0.12
Reporter: Christoph Bauer
 Fix For: 0.12

 Attachments: PIG-3108.patch


 Consider the following:
 A and B should be the same (with different order, of course).
 {code}
 /*
 in hbase shell:
 create 'pigtest', 'pig'
 put 'pigtest' , '1', 'pig:name', 'A'
 put 'pigtest' , '1', 'pig:has_legs', 'true'
 put 'pigtest' , '1', 'pig:has_ribs', 'true'
 */
 A = LOAD 'hbase://pigtest' USING 
 org.apache.pig.backend.hadoop.hbase.HBaseStorage('pig:name pig:has*') AS 
 (name:chararray,parts);
 B = LOAD 'hbase://pigtest' USING 
 org.apache.pig.backend.hadoop.hbase.HBaseStorage('pig:has* pig:name') AS 
 (parts,name:chararray);
 dump A;
 dump B;
 {code}
 This is due to a bug in setLocation and initScan.
 For _A_ 
 # scan.addColumn(pig,name); // for 'pig:name'
 # scan.addFamily(pig); // for the 'pig:has*'
 So that's silently right.
 But for _B_
 # scan.addFamily(pig)
 # scan.addColumn(pig,name)
 will override the first call to addFamily, because you cannot mix them on the 
 same family.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Pig-trunk #1384

2013-01-04 Thread Apache Jenkins Server

See https://builds.apache.org/job/Pig-trunk/1384/changes

Changes:

[cheolsoo] PIG-2362: Rework Ant build.xml to use macrodef instead of antcall 
(azaroth via cheolsoo)

--
[...truncated 6491 lines...]
 [findbugs]   jline.History
 [findbugs]   org.jruby.embed.internal.LocalContextProvider
 [findbugs]   org.apache.hadoop.io.BooleanWritable
 [findbugs]   org.apache.log4j.Logger
 [findbugs]   org.apache.hadoop.hbase.filter.FamilyFilter
 [findbugs]   org.codehaus.jackson.annotate.JsonPropertyOrder
 [findbugs]   groovy.lang.Tuple
 [findbugs]   org.antlr.runtime.IntStream
 [findbugs]   org.apache.hadoop.util.ReflectionUtils
 [findbugs]   org.apache.hadoop.fs.ContentSummary
 [findbugs]   org.jruby.runtime.builtin.IRubyObject
 [findbugs]   org.jruby.RubyInteger
 [findbugs]   org.python.core.PyTuple
 [findbugs]   org.mortbay.log.Log
 [findbugs]   org.apache.hadoop.conf.Configuration
 [findbugs]   com.google.common.base.Joiner
 [findbugs]   org.apache.hadoop.mapreduce.lib.input.FileSplit
 [findbugs]   org.apache.hadoop.mapred.Counters$Counter
 [findbugs]   com.jcraft.jsch.Channel
 [findbugs]   org.apache.hadoop.mapred.JobPriority
 [findbugs]   org.apache.commons.cli.Options
 [findbugs]   org.apache.hadoop.mapred.JobID
 [findbugs]   org.apache.hadoop.util.bloom.BloomFilter
 [findbugs]   org.python.core.PyFrame
 [findbugs]   org.apache.hadoop.hbase.filter.CompareFilter
 [findbugs]   org.apache.hadoop.util.VersionInfo
 [findbugs]   org.python.core.PyString
 [findbugs]   org.apache.hadoop.io.Text$Comparator
 [findbugs]   org.jruby.runtime.Block
 [findbugs]   org.antlr.runtime.MismatchedSetException
 [findbugs]   org.apache.hadoop.io.BytesWritable
 [findbugs]   org.apache.hadoop.fs.FsShell
 [findbugs]   org.joda.time.Months
 [findbugs]   org.mozilla.javascript.ImporterTopLevel
 [findbugs]   org.apache.hadoop.hbase.mapreduce.TableOutputFormat
 [findbugs]   org.apache.hadoop.mapred.TaskReport
 [findbugs]   org.apache.hadoop.security.UserGroupInformation
 [findbugs]   org.antlr.runtime.tree.RewriteRuleSubtreeStream
 [findbugs]   org.apache.commons.cli.HelpFormatter
 [findbugs]   com.google.common.collect.Maps
 [findbugs]   org.joda.time.ReadableInstant
 [findbugs]   org.mozilla.javascript.NativeObject
 [findbugs]   org.apache.hadoop.hbase.HConstants
 [findbugs]   org.apache.hadoop.io.serializer.Deserializer
 [findbugs]   org.antlr.runtime.FailedPredicateException
 [findbugs]   org.apache.hadoop.io.compress.CompressionCodec
 [findbugs]   org.jruby.RubyNil
 [findbugs]   org.apache.hadoop.fs.FileStatus
 [findbugs]   org.apache.hadoop.hbase.client.Result
 [findbugs]   org.apache.hadoop.mapreduce.JobContext
 [findbugs]   org.codehaus.jackson.JsonGenerator
 [findbugs]   org.apache.hadoop.mapreduce.TaskAttemptContext
 [findbugs]   org.apache.hadoop.io.LongWritable$Comparator
 [findbugs]   org.codehaus.jackson.map.util.LRUMap
 [findbugs]   org.apache.hadoop.hbase.util.Bytes
 [findbugs]   org.antlr.runtime.MismatchedTokenException
 [findbugs]   org.codehaus.jackson.JsonParser
 [findbugs]   com.jcraft.jsch.UserInfo
 [findbugs]   org.apache.hadoop.hbase.filter.WhileMatchFilter
 [findbugs]   org.python.core.PyException
 [findbugs]   org.apache.commons.cli.ParseException
 [findbugs]   org.apache.hadoop.io.compress.CompressionOutputStream
 [findbugs]   org.apache.hadoop.hbase.filter.WritableByteArrayComparable
 [findbugs]   org.antlr.runtime.tree.CommonTreeNodeStream
 [findbugs]   org.apache.log4j.Level
 [findbugs]   org.apache.hadoop.hbase.client.Scan
 [findbugs]   org.jruby.anno.JRubyMethod
 [findbugs]   org.apache.hadoop.mapreduce.Job
 [findbugs]   com.google.common.util.concurrent.Futures
 [findbugs]   org.apache.commons.logging.LogFactory
 [findbugs]   org.apache.commons.collections.IteratorUtils
 [findbugs]   org.apache.commons.codec.binary.Base64
 [findbugs]   org.codehaus.jackson.map.ObjectMapper
 [findbugs]   org.apache.hadoop.fs.FileSystem
 [findbugs]   org.jruby.embed.LocalContextScope
 [findbugs]   org.apache.hadoop.hbase.filter.FilterList$Operator
 [findbugs]   org.jruby.RubySymbol
 [findbugs]   org.codehaus.jackson.map.annotate.JacksonStdImpl
 [findbugs]   org.apache.hadoop.hbase.io.ImmutableBytesWritable
 [findbugs]   org.apache.hadoop.io.serializer.SerializationFactory
 [findbugs]   org.antlr.runtime.tree.TreeAdaptor
 [findbugs]   org.apache.hadoop.mapred.RunningJob
 [findbugs]   org.antlr.runtime.CommonTokenStream
 [findbugs]   org.apache.hadoop.io.DataInputBuffer
 [findbugs]   org.apache.hadoop.io.file.tfile.TFile
 [findbugs]   org.apache.commons.cli.GnuParser
 [findbugs]   org.mozilla.javascript.Context
 [findbugs]   org.apache.hadoop.io.FloatWritable
 [findbugs]   org.antlr.runtime.tree.RewriteEarlyExitException
 [findbugs]   org.apache.hadoop.hbase.HBaseConfiguration
 [findbugs]   org.codehaus.jackson.JsonGenerationException
 [findbugs]   org.apache.hadoop.mapreduce.TaskInputOutputContext
 [findbugs]   org.apache.hadoop.io.compress.GzipCodec
 [findbugs]   org.jruby.RubyString

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-5.patch

Added fixes for compression (and other metadata)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Cheolsoo Park (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544116#comment-13544116
 ] 

Cheolsoo Park commented on PIG-3015:


Hi Joe, I think you forgot to add new files to the patch. Do you mind uploading 
the patch again? :-)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch, PIG-3015.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: (was: PIG-3015-5.patch)

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Adler updated PIG-3015:
--

Attachment: PIG-3015-5.patch

Oops, this one contains the changes.

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Review Request: PIG-3015 Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/
---

(Updated Jan. 4, 2013, 7:22 p.m.)


Review request for pig and Cheolsoo Park.


Changes
---

Fixes to make compression work


Description
---

The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

This is the latest version of the patch, complete with test cases and 
TrevniStorage. (Test cases for TrevniStorage are still missing).


This addresses bug PIG-3015.
https://issues.apache.org/jira/browse/PIG-3015


Diffs (updated)
-

  .eclipse.templates/.classpath c7b83b8 
  ivy.xml 70e8d50 
  ivy/libraries.properties 7b07c7e 
  src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
  src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroArrayReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroBagWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroMapWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordWriter.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageDataConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java PRE-CREATION 
  test/commit-tests 5081fbc 
  test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_blank_first_args.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/namesWithDoubleColons.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_avro.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_trevni.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arraysAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithDoubleUnderscores.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arraysAsOutputByPig.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc 
PRE-CREATION

Re: Review Request: PIG-3015 Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/
---

(Updated Jan. 4, 2013, 7:22 p.m.)


Review request for pig and Cheolsoo Park.


Description
---

The current AvroStorage implementation has a lot of issues: it requires old 
versions of Avro, it copies data much more than needed, and it's verbose and 
complicated. (One pet peeve of mine is that old versions of Avro don't support 
Snappy compression.)

I rewrote AvroStorage from scratch to fix these issues. In early tests, the new 
implementation is significantly faster, and the code is a lot simpler. 
Rewriting AvroStorage also enabled me to implement support for Trevni.

This is the latest version of the patch, complete with test cases and 
TrevniStorage. (Test cases for TrevniStorage are still missing).


This addresses bug PIG-3015.
https://issues.apache.org/jira/browse/PIG-3015


Diffs
-

  .eclipse.templates/.classpath c7b83b8 
  ivy.xml 70e8d50 
  ivy/libraries.properties 7b07c7e 
  src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION 
  src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroArrayReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroBagWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroMapWrapper.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordReader.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroRecordWriter.java PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageDataConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroStorageSchemaConversionUtilities.java 
PRE-CREATION 
  src/org/apache/pig/impl/util/avro/AvroTupleWrapper.java PRE-CREATION 
  test/commit-tests 5081fbc 
  test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_blank_first_args.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/namesWithDoubleColons.pig 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_avro.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/code/pig/trevni_to_trevni.pig PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/arraysAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithDoubleUnderscores.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/arraysAsOutputByPig.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc 
PRE-CREATION 
  test/org/apache/pig/builtin/avro/schema/recordsWithDoubleUnderscores.avsc 
PRE-CREATION

[jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds

2013-01-04 Thread Russell Jurney (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544219#comment-13544219
]

Russell Jurney commented on PIG-3059:
-

Regarding Avro, in reading
https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java
- it looks like you can still sync to the next record under most bad reads. We
should do so.

You're right about a bad sync halting things, but in the case of a bad sync -
you might try advancing by some amount using seek() and then sync'ing again? I
think this would work. I could be wrong, but in looking how seeks work - I
think that would be ok. Kinda neat, maybe? Worst case, we would only throw out
inputsplits on a bad sync(), not a bad read(). length() should help, as might
pastSync(), skip() and available()

I agree with Dmitriy's feedback, thanks for taking the time.

Global configurable minimum 'bad record' thresholds
---

Key: PIG-3059
URL: https://issues.apache.org/jira/browse/PIG-3059
Project: Pig
Issue Type: New Feature
Components: impl
Affects Versions: 0.11
Reporter: Russell Jurney
Assignee: Cheolsoo Park
Fix For: 0.12

Attachments: avro_test_files-2.tar.gz, PIG-3059-2.patch,
PIG-3059.patch

See PIG-2614.
Pig dies when one record in a LOAD of a billion records fails to parse. This
is almost certainly not the desired behavior. elephant-bird and some other
storage UDFs have minimum thresholds in terms of percent and count that must
be exceeded before a job will fail outright.
We need these limits to be configurable for Pig, globally. I've come to
realize what a major problem Pig's crashing on bad records is for new Pig
users. I believe this feature can greatly improve Pig.
An example of a config would look like:
pig.storage.bad.record.threshold=0.01
pig.storage.bad.record.min=100
A thorough discussion of this issue is available here:
http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Russell Jurney (JIRA)

[
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544222#comment-13544222
]

Russell Jurney commented on PIG-3015:
-

Joe, some comments on handling errors in PIG-3059:

Rewrite of AvroStorage
--

Key: PIG-3015
URL: https://issues.apache.org/jira/browse/PIG-3015
Project: Pig
Issue Type: Improvement
Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch,
PIG-3015-5.patch

The current AvroStorage implementation has a lot of issues: it requires old
versions of Avro, it copies data much more than needed, and it's verbose and
complicated. (One pet peeve of mine is that old versions of Avro don't
support Snappy compression.)
I rewrote AvroStorage from scratch to fix these issues. In early tests, the
new implementation is significantly faster, and the code is a lot simpler.
Rewriting AvroStorage also enabled me to implement support for Trevni (as
TrevniStorage).
I'm opening this ticket to facilitate discussion while I figure out the best
way to contribute the changes back to Apache.

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2013-01-04 Thread Joseph Adler (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544248#comment-13544248
 ] 

Joseph Adler commented on PIG-3015:
---

Hi Russ,

I think you're right... it looks like you could do something like this in 
AvroRecordReader.nextKeyValue:

{code}
  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {

if (reader.pastSync(end)) {
  return false;
}

try {
  currentRecord = reader.next(new GenericData.Record(schema));
} catch (NoSuchElementException e) {
  return false;
} catch (IOException ioe) {
  reader.sync(reader.tell()+1);
  throw ioe;
}

return true;
  }
{code}

Let me test this out to make sure it runs correctly on uncorrupted files. Would 
you mind creating a corrupted test file that I can use for testing?

 Rewrite of AvroStorage
 --

 Key: PIG-3015
 URL: https://issues.apache.org/jira/browse/PIG-3015
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Reporter: Joseph Adler
Assignee: Joseph Adler
 Attachments: PIG-3015-2.patch, PIG-3015-3.patch, PIG-3015-4.patch, 
 PIG-3015-5.patch


 The current AvroStorage implementation has a lot of issues: it requires old 
 versions of Avro, it copies data much more than needed, and it's verbose and 
 complicated. (One pet peeve of mine is that old versions of Avro don't 
 support Snappy compression.)
 I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
 new implementation is significantly faster, and the code is a lot simpler. 
 Rewriting AvroStorage also enabled me to implement support for Trevni (as 
 TrevniStorage).
 I'm opening this ticket to facilitate discussion while I figure out the best 
 way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

2013-01-04 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544383#comment-13544383
 ] 

Rohini Palaniswamy commented on PIG-2433:
-

ant clean test -Dtestcase=TestScriptUDF passes for me. 

 Jython import module not working if module path is in classpath
 ---

 Key: PIG-2433
 URL: https://issues.apache.org/jira/browse/PIG-2433
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-2433.patch


 This is a hole of PIG-1824. If the path of python module is in classpath, job 
 die with the message could not instantiate 
 'org.apache.pig.scripting.jython.JythonFunction'.
 Here is my observation:
 If the path of python module is in classpath, fileEntry we got in 
 JythonScriptEngine:236 is __pyclasspath__/script$py.class instead of the 
 script itself. Thus we cannot locate the script and skip the script in 
 job.xml. 
 For example:
 {code}
 register 'scriptB.py' using 
 org.apache.pig.scripting.jython.JythonScriptEngine as pig
 A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long);
 B = foreach A generate pig.square(a0);
 dump B;
 scriptB.py:
 #!/usr/bin/python
 import scriptA
 @outputSchema(x:{t:(num:double)})
 def sqrt(number):
  return (number ** .5)
 @outputSchema(x:{t:(num:long)})
 def square(number):
  return long(scriptA.square(number))
 scriptA.py:
 #!/usr/bin/python
 def square(number):
  return (number * number)
 {code}
 When we register scriptB.py, we use jython library to figure out the 
 dependent modules scriptB relies on, in this case, scriptA. However, if 
 current directory is in classpath, instead of scriptA.py, we get 
 __pyclasspath__/scriptA.class. Then we try to put 
 __pyclasspath__/script$py.class into job.jar, Pig complains 
 __pyclasspath__/script$py.class does not exist. 
 This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 
 20.x, the test still success because MiniCluster will take local classpath so 
 it can still find scriptA.py even if it is not in job.jar. However, the 
 script will fail in real cluster and MiniMRYarnCluster of hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

2013-01-04 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13544484#comment-13544484
 ] 

Rohini Palaniswamy commented on PIG-2433:
-

One cause for this error could be that your python cache dir is not writable 
and so the pig jar was not processed. Try running with -Dpython.cachedir=/dir 
with write perms if that is the case. Or are you running from eclipse?

 Jython import module not working if module path is in classpath
 ---

 Key: PIG-2433
 URL: https://issues.apache.org/jira/browse/PIG-2433
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.10.0
Reporter: Daniel Dai
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-2433.patch


 This is a hole of PIG-1824. If the path of python module is in classpath, job 
 die with the message could not instantiate 
 'org.apache.pig.scripting.jython.JythonFunction'.
 Here is my observation:
 If the path of python module is in classpath, fileEntry we got in 
 JythonScriptEngine:236 is __pyclasspath__/script$py.class instead of the 
 script itself. Thus we cannot locate the script and skip the script in 
 job.xml. 
 For example:
 {code}
 register 'scriptB.py' using 
 org.apache.pig.scripting.jython.JythonScriptEngine as pig
 A = LOAD 'table_testPythonNestedImport' as (a0:long, a1:long);
 B = foreach A generate pig.square(a0);
 dump B;
 scriptB.py:
 #!/usr/bin/python
 import scriptA
 @outputSchema(x:{t:(num:double)})
 def sqrt(number):
  return (number ** .5)
 @outputSchema(x:{t:(num:long)})
 def square(number):
  return long(scriptA.square(number))
 scriptA.py:
 #!/usr/bin/python
 def square(number):
  return (number * number)
 {code}
 When we register scriptB.py, we use jython library to figure out the 
 dependent modules scriptB relies on, in this case, scriptA. However, if 
 current directory is in classpath, instead of scriptA.py, we get 
 __pyclasspath__/scriptA.class. Then we try to put 
 __pyclasspath__/script$py.class into job.jar, Pig complains 
 __pyclasspath__/script$py.class does not exist. 
 This is exactly TestScriptUDF.testPythonNestedImport is doing. In hadoop 
 20.x, the test still success because MiniCluster will take local classpath so 
 it can still find scriptA.py even if it is not in job.jar. However, the 
 script will fail in real cluster and MiniMRYarnCluster of hadoop 23.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Subscription: PIG patch available

2013-01-04 Thread jira

Issue Subscription
Filter: PIG patch available (36 issues)

Subscriber: pigdaily

Key Summary
PIG-3108HBaseStorage returns empty maps when mixing wildcard- with other 
columns
https://issues.apache.org/jira/browse/PIG-3108
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3098Add another test for the self join case
https://issues.apache.org/jira/browse/PIG-3098
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3086Allow A Prefix To Be Added To URIs In PigUnit Tests 
https://issues.apache.org/jira/browse/PIG-3086
PIG-3078Make a UDF that, given a string, returns just the columns prefixed 
by that string
https://issues.apache.org/jira/browse/PIG-3078
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2878Pig current releases lack a UDF equalIgnoreCase.This function 
returns a Boolean value indicating whether string left is equal to string 
right. This check is case insensitive.
https://issues.apache.org/jira/browse/PIG-2878
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2788improved string interpolation of variables
https://issues.apache.org/jira/browse/PIG-2788
PIG-2769a simple logic causes very long compiling time on pig 0.10.0
https://issues.apache.org/jira/browse/PIG-2769
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issues.apache.org/jira/browse/PIG-1942
PIG-1237Piggybank MutliStorage - specify field to write in output

[jira] [Commented] (PIG-2362) Rework Ant build.xml to use macrodef instead of antcall

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

[jira] [Resolved] (PIG-2362) Rework Ant build.xml to use macrodef instead of antcall

Re: [VOTE] Release Pig 0.10.1 (candidate 3)

[jira] [Commented] (PIG-3108) HBaseStorage returns empty maps when mixing wildcard- with other columns

Build failed in Jenkins: Pig-trunk #1384

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

[jira] [Updated] (PIG-3015) Rewrite of AvroStorage

Re: Review Request: PIG-3015 Rewrite of AvroStorage

Re: Review Request: PIG-3015 Rewrite of AvroStorage

[jira] [Commented] (PIG-3059) Global configurable minimum 'bad record' thresholds

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

[jira] [Commented] (PIG-2433) Jython import module not working if module path is in classpath

[jira] Subscription: PIG patch available

19 matches

Site Navigation

Mail list logo

Footer information