Please help review patch for PIG-4047: Replace pig-withouthadoop jar with pig-core jar and pig core dependencies

lulynn_2008 Sun, 06 Jul 2014 20:57:07 -0700

Hi Daniel,

PIG-4047 is created for tracing this discussion and patch is available for 
review. Could you help to review it?
The patch file is based on trunk branch. I assume site is meant to trunk 
branch. Please correct me if I was wrong.


Thanks









At 2014-07-02 08:33:21, "Daniel Dai" <da...@hortonworks.com> wrote:
>I see what's happening. JarManager finds the enclosing jar from
>classpath and wrap those into job.jar. Originally I want to ship
>dependent jars separately through distributed cache, so we don't have
>to create job.jar every time, and those jars will get reused due to
>PIG-2672, and for Pig on tez, there are more chance we can get
>session/container reuse since we don't rely on job.jar which differs
>job by job. This will solve the pig-withouthadoop.jar eventually. But
>now I realize that pig-withouthadoop.jar issue can be solved
>independently by just putting dependent jars on lib and use
>pig-core.jar instead of pig-withouthadoop.jar. That's doable, and even
>if I go for distributed cache, I still need to put pig-core.jar in the
>place of pig-withouthadoop.jar and dependent jars in lib, there is no
>conflict. So go ahead do that.
>
>Thanks,
>Daniel
>
>On Mon, Jun 30, 2014 at 11:34 PM, lulynn_2008 <lulynn_2...@163.com> wrote:
>> Hi Deniel,
>> with new struntured pig package, the scripts run succeeded with pig-0.12.0.
>> I just did following:
>> 1. devided without hadoop jar into pig core and pig core dependencies.
>> 2. save jars in 1# in lib directory
>> 3. in pig script, always add all the jars in lib directory into classpath 
>> and add pig core jar into classpath.
>>
>> Here is the test case and result:
>> [pig@hostname bin]$ ./pig
>> grunt> a = load 'studenttab10k' as (name:chararray, age:int, gpa:double);
>> grunt> b = filter a by name matches '.*or.*';
>> grunt> dump b;
>> INFO  [JobControl] org.apache.hadoop.mapreduce.lib.input.FileInputFormat     
>> - Total input paths to process : 1
>> INFO  [main] org.apache.hadoop.mapreduce.lib.input.FileInputFormat     - 
>> Total input paths to process : 1
>> (.hor.h,11,8.0)
>> grunt> dump a;
>> INFO  [JobControl] org.apache.hadoop.mapreduce.lib.input.FileInputFormat     
>> - Total input paths to process : 1
>> INFO  [main] org.apache.hadoop.mapreduce.lib.input.FileInputFormat     - 
>> Total input paths to process : 1
>> (lynn,18,4.5)
>> (lulu,12,4.4)
>> (fat,12,5.0)
>> (.hor.h,11,8.0)
>> Please forgive me if I was wrong. I think they should be the same to 
>> backend, if we add without hadoop into classpath or add pig core and core 
>> dependencies one by one. Pig should be OK once we added all the dependencies 
>> into classpath. Not matter in which way.
>>
>> Do you have any suggestion on other test or comment?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>> At 2014-06-25 11:51:24, "Daniel Dai" <da...@hortonworks.com> wrote:
>>>Pig will not fail every time, but you will hit it depending on your
>>>script. Try the following script:
>>>
>>>a = load 'studenttab10k' as (name:chararray, age:int, gpa:double);
>>>b = filter a by name matches '.*or.*';
>>>dump b;
>>>
>>>Thanks,
>>>Daniel
>>>
>>>On Tue, Jun 24, 2014 at 7:23 PM, lulynn_2008 <lulynn_2...@163.com> wrote:
>>>> Hi Deniel,
>>>> Thanks for your detail.
>>>>
>>>> For your following reply:
>>>> " Yes, that's a better solution. However, those jars are also needed in the
>>>> backend and we need to ship those jars into distributed cache. There will 
>>>> be
>>>> some code change to make it happen. Just change pig script is not
>>>> sufficient."
>>>> To distributed cache part, could you provide an example? To show how thoese
>>>> jars are shipped.  Thanks
>>>>
>>>> Since release 0.10.0, I am using pig with the structure I mentioned at 2#.
>>>> That is
>>>> -  deviding withouthadoop jar into pig core and dependencies.
>>>> -  case all the dependencies into lib directory
>>>> -  add pig core and these dependencies into classpath in pig script instead
>>>> of withouthadoop jar.
>>>> I also used pig-0.11.1, 0.12.0, and alwayse running pig with pig grunt. Did
>>>> not encounter issues/errors till now. I run avro/hbase/hive/hadoop related
>>>> things. Did I miss anything to test this part? Please give your 
>>>> suggestions.
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>> At 2014-06-25 02:34:03, "Daniel Dai" <da...@hortonworks.com> wrote:
>>>>>On Mon, Jun 23, 2014 at 8:13 PM, lulynn_2008 <lulynn_2...@163.com> wrote:
>>>>>> Hi All,
>>>>>> In build.xml in branch-0.13,
>>>>>> 1. following jars are included into lib directory during packaging:
>>>>>>         <copy todir="${tar.dist.dir}/lib">
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="jython-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="jruby-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="groovy-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="js-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="hbase-*.jar"
>>>>>> excludes="hbase-*tests.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}"
>>>>>> includes="protobuf-java-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="zookeeper-*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="accumulo-*.jar"
>>>>>> excludes="accumulo-minicluster*.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="avro-*.jar"
>>>>>> excludes="avro-*tests.jar"/>
>>>>>>             <fileset dir="${ivy.lib.dir}" includes="json-simple-*.jar"/>
>>>>>>         </copy>
>>>>>> 2. following jars are included in without hadoop jar file:
>>>>>>     <fileset dir="${ivy.lib.dir}"
>>>>>> id="runtime.dependencies-withouthadoop.jar">
>>>>>>         <patternset id="pattern.runtime.dependencies-withouthadoop.jar">
>>>>>>             <include name="antlr-runtime-${antlr.version}.jar"/>
>>>>>>             <include name="ST4-${stringtemplate.version}.jar"/>
>>>>>>             <include name="jline-${jline.version}.jar"/>
>>>>>>             <include name="jackson-mapper-asl-${jackson.version}.jar"/>
>>>>>>             <include name="jackson-core-asl-${jackson.version}.jar"/>
>>>>>>             <include name="joda-time-${joda-time.version}.jar"/>
>>>>>>             <include name="guava-${guava.version}.jar"/>
>>>>>>             <include name="automaton-${automaton.version}.jar"/>
>>>>>>             <include name="jansi-${jansi.version}.jar"/>
>>>>>>             <include name="avro-${avro.version}.jar"/>
>>>>>>             <include name="avro-mapred-${avro.version}.jar"/>
>>>>>>             <include name="trevni-core-${avro.version}.jar"/>
>>>>>>             <include name="trevni-avro-${avro.version}.jar"/>
>>>>>>             <include name="snappy-java-${snappy.
>>>> version}.jar"/>
>>>>>>             <include name="asm*.jar"/>
>>>>>>         </patternset>
>>>>>>     </fileset>
>>>>>> Questions:
>>>>>> 1. Could you tell what the jars in 1# and 2# are used for? What are the
>>>>>> differences between them?
>>>>>> 1. Seems all the jars in 1# and 2# will be necessary when pig running at
>>>>>> hadoop cluster. Correct?
>>>>>
>>>>>#2 are core dependency Pig depends on, those will be wrapped into
>>>>>pig-withouthadoop.jar. #1 are the convenience jar used only when user
>>>>>are using specific loader, such as HBaseStorage/AvroStorage, those are
>>>>>not needed if you don't use such loader, and will not be wrapped into
>>>>>pig-withouthadoop.jar
>>>>>
>>>>>> 2. without hadoop jar file will be invoked during pig running in hadoop
>>>>>> cluster, and this jar file includes pig core and some dependencies. 
>>>>>> Could we
>>>>>> just generate pig core jar and move all the dependencies into lib 
>>>>>> directory?
>>>>>> Then in pig script, we could alwayse add pig core into classpath add the
>>>>>> dependencies into classpath in "if [ -n "$HADOOP_BIN" ]; then" option. If
>>>>>> this:
>>>>>> -  it would be clear for user to check dependencies version
>>>>>> -  it would be easiler to mantain dependencies, Like version update.
>>>>>> Sometime, I would like to check whether pig could work with difference
>>>>>> version dependencies.
>>>>>
>>>>>Yes, that's a better solution. However, those jars are also needed in
>>>>>the backend and we need to ship those jars into distributed cache.
>>>>>There will be some code change to make it happen. Just change pig
>>>>>script is not sufficient.
>>>>>
>>>>>> Do you have any concern?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>--
>>>>>CONFIDENTIALITY NOTICE
>>>>>NOTICE: This message is intended for the use of the individual or entity to
>>>>>which it is addressed and may contain information that is confidential,
>>>>>privileged and exempt from disclosure under applicable law. If the reader
>>>>>of this message is not the intended recipient, you are hereby notified that
>>>>>any printing, copying, dissemination, distribution, disclosure or
>>>>>forwarding of this communication is strictly prohibited. If you have
>>>>>received this communication in error, please contact the sender immediately
>>>>>and delete it from your system. Thank You.
>>>>
>>>>
>>>>
>>>
>>>--
>>>CONFIDENTIALITY NOTICE
>>>NOTICE: This message is intended for the use of the individual or entity to
>>>which it is addressed and may contain information that is confidential,
>>>privileged and exempt from disclosure under applicable law. If the reader
>>>of this message is not the intended recipient, you are hereby notified that
>>>any printing, copying, dissemination, distribution, disclosure or
>>>forwarding of this communication is strictly prohibited. If you have
>>>received this communication in error, please contact the sender immediately
>>>and delete it from your system. Thank You.
>
>-- 
>CONFIDENTIALITY NOTICE
>NOTICE: This message is intended for the use of the individual or entity to 
>which it is addressed and may contain information that is confidential, 
>privileged and exempt from disclosure under applicable law. If the reader 
>of this message is not the intended recipient, you are hereby notified that 
>any printing, copying, dissemination, distribution, disclosure or 
>forwarding of this communication is strictly prohibited. If you have 
>received this communication in error, please contact the sender immediately 
>and delete it from your system. Thank You.

Please help review patch for PIG-4047: Replace pig-withouthadoop jar with pig-core jar and pig core dependencies

Reply via email to