Re: Why does Pig not use default resources from the Configuration object?

Prashant Kommireddi Mon, 15 Apr 2013 16:44:05 -0700

Sounds good. Here is a doc on contributing patch (for some pointers)
https://cwiki.apache.org/confluence/display/PIG/HowToContribute



On Mon, Apr 15, 2013 at 4:37 PM, Bhooshan Mogal <[email protected]>wrote:

> Hey Prashant,
>
> Yup, I can take a stab at it. This is the first time I am looking at Pig
> code, so I might take some time to get started. Will get back to you if I
> have questions in the meantime. And yes, I will write it so it reads a pig
> property.
>
> -
> Bhooshan.
>
>
> On Mon, Apr 15, 2013 at 11:58 AM, Prashant Kommireddi <[email protected]
> > wrote:
>
>> Hi Bhooshan,
>>
>> This makes more sense now. I think overriding fs implementation should go
>> into core-site.xml, but it would be useful to be able to add resources if
>> you have a bunch of other properties.
>>
>> Would you like to submit a patch? It should be based on a pig property
>> that suggests the additional resource names (myfs-site.xml) in your case.
>>
>> -Prashant
>>
>>
>> On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <
>> [email protected]> wrote:
>>
>>> Hi Prashant,
>>>
>>>
>>> Yes, I am running in MapReduce mode. Let me give you the steps in the
>>> scenario that I am trying to test -
>>>
>>> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
>>> filesystem I am trying to implement - Let's call it MyFileSystem.class.
>>> This filesystem uses the scheme myfs:// for its URIs
>>> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and
>>> made the class available through a jar file that is part of
>>> HADOOP_CLASSPATH (or PIG_CLASSPATH).
>>> 3. In MyFileSystem.class, I have a static block as -
>>> static {
>>>     Configuration.addDefaultResource("myfs-default.xml");
>>>     Configuration.addDefaultResource("myfs-site.xml");
>>> }
>>> Both these files are in the classpath. To be safe, I have also added the
>>> my-fs-site.xml in the constructor of MyFileSystem as
>>> conf.addResource("myfs-site.xml"), so that it is part of both the default
>>> resources as well as the non-default resources in the Configuration object.
>>> 4. I am trying to access the filesystem in my pig script as -
>>> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
>>> (name:chararray, age:int); -- loading data
>>> B = FOREACH A GENERATE name;
>>> store B into 'myfs://myhost.com:8999/testoutput';
>>> 5. The execution seems to start correctly, and MyFileSystem.class is
>>> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
>>> is loaded and the properties defined in it are available.
>>> 6. However, when Pig tries to submit the job, it cannot find these
>>> properties and the job fails to submit successfully.
>>> 7. If I move all the properties defined in myfs-site.xml to
>>> core-site.xml, the job gets submitted successfully, and it even succeeds.
>>> However, this is not ideal as I do not want to proliferate core-site.xml
>>> with all of the properties for a separate filesystem.
>>> 8. As I said earlier, upon taking a closer look at the pig code, I saw
>>> that while creating the JobConf object for a job, pig adds very specific
>>> resources to the job object, and ignores the resources that may have been
>>> added already (eg myfs-site.xml) in the Configuration object.
>>> 9. I have tested this with native map-reduce code as well as hive, and
>>> this approach of having a separate config file for MyFileSystem works fine
>>> in both those cases.
>>>
>>> So, to summarize, I am looking for a way to ask Pig to load parameters
>>> from my own config file before submitting a job.
>>>
>>> Thanks,
>>> -
>>> Bhooshan.
>>>
>>>
>>>
>>> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi <
>>> [email protected]> wrote:
>>>
>>>> +User group
>>>>
>>>> Hi Bhooshan,
>>>>
>>>> By default you should be running in MapReduce mode unless specified
>>>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>>>> provide your code here?
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
>>>> wrote:
>>>>
>>>>  Apologies for the premature send. I may have some more information.
>>>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>>>> I saw an NPE (stacktrace below) and a message saying pig was running in
>>>> exectype local -
>>>>
>>>> 2013-04-13 07:37:13,758 [main] INFO
>>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>>>> to hadoop file system at: local
>>>> 2013-04-13 07:37:13,760 [main] WARN
>>>> org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is
>>>> deprecated. Instead, use mapreduce.client.genericoptionsparser.used
>>>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>>> ERROR 1200: Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>
>>>>
>>>> Here is the stacktrace =
>>>>
>>>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>>>> during parsing. Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>>>         at
>>>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>>>         at
>>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>>>         at
>>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>>>         at org.apache.pig.Main.run(Main.java:555)
>>>>         at org.apache.pig.Main.main(Main.java:111)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>         at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>> Caused by: Failed to parse: Pig script failed to parse:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>>>         at
>>>> org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>>>         ... 14 more
>>>> Caused by:
>>>> <file test.pig, line 1, column 4> pig script failed to validate:
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>>>         at
>>>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>>>         at
>>>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>>>         ... 15 more
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <
>>>> [email protected]> wrote:
>>>>
>>>>> Yes, however I did not add core-site.xml, hdfs-site.xml,
>>>>> yarn-site.xml. Only my-filesystem-site.xml using both
>>>>> Configuration.addDefaultResource and Configuration.addResource.
>>>>>
>>>>> I see what you are saying though. The patch might require users to
>>>>> take care of adding the default config resources as well apart from their
>>>>> own resources?
>>>>>
>>>>>
>>>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>>>> configuration resources?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Prashant,
>>>>>>>
>>>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>>>> reply. I was not subscribed to the dev mailing list and hence did not 
>>>>>>> get a
>>>>>>> notification about your reply. I have copied our thread below so you can
>>>>>>> get some context.
>>>>>>>
>>>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>>>> like pig is unable to find core-site.xml. It indicates that it is 
>>>>>>> running
>>>>>>> the script in local mode inspite of having fs.default.name defined
>>>>>>> as the location of the HDFS namenode.
>>>>>>>
>>>>>>> Here is what I am trying to do - I have developed my own
>>>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it 
>>>>>>> in
>>>>>>> my pig script. This implementation requires its own *-default and
>>>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH 
>>>>>>> as
>>>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these 
>>>>>>> files,
>>>>>>> as I am able to read these configurations in my code. However, pig code
>>>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>>>> the pig code, it seems to me that pig does not use all the resources 
>>>>>>> added
>>>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this 
>>>>>>> as
>>>>>>> the problem, because pig can find my config parameters if I define them 
>>>>>>> in
>>>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>>>
>>>>>>> Let me know if you need more details about the issue.
>>>>>>>
>>>>>>>
>>>>>>> Here is our previous conversation -
>>>>>>>
>>>>>>> Hi Bhooshan,
>>>>>>>
>>>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>>>> version
>>>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>>>
>>>>>>> With this patch, the following property will allow you to override the
>>>>>>> default and pass in your own configuration.
>>>>>>> pig.use.overriden.hadoop.configs=true
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>> > Hi Folks,
>>>>>>> >
>>>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage 
>>>>>>> > system
>>>>>>> > at work. This implementation uses some config files that are similar 
>>>>>>> > in
>>>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>>>> > *-site.xml for users to override default properties. In the class that
>>>>>>> > implemented the Hadoop FileSystem, I had added these configuration 
>>>>>>> > files as
>>>>>>> > default resources in a static block using
>>>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine 
>>>>>>> > and
>>>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs 
>>>>>>> > just fine
>>>>>>> > for our storage system. However, when we tried using this storage 
>>>>>>> > system in
>>>>>>> > pig scripts, we saw errors indicating that our configuration 
>>>>>>> > parameters
>>>>>>> > were not available. Upon further debugging, we saw that the config 
>>>>>>> > files
>>>>>>> > were added to the Configuration object as resources, but were part of
>>>>>>> > defaultResources. However, in Main.java in the pig source, we saw 
>>>>>>> > that the
>>>>>>> > Configuration object was created as Configuration conf = new
>>>>>>> > Configuration(false);, thereby setting loadDefaults to false in the 
>>>>>>> > conf
>>>>>>> > object. As a result, properties from the default resources (including 
>>>>>>> > my
>>>>>>> > config files) were not loaded and hence, unavailable.
>>>>>>> >
>>>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>>>> > Configuration.addDefaultResource, but still could not figure out why 
>>>>>>> > Pig
>>>>>>> > does not use default resources?
>>>>>>> >
>>>>>>> > Could someone on the list explain why this is the case?
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > --
>>>>>>> > Bhooshan
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Bhooshan
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bhooshan
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to