Re: Why does Pig not use default resources from the Configuration object?

Prashant Kommireddi Mon, 15 Apr 2013 11:59:07 -0700

Hi Bhooshan,

This makes more sense now. I think overriding fs implementation should go
into core-site.xml, but it would be useful to be able to add resources if
you have a bunch of other properties.


Would you like to submit a patch? It should be based on a pig property that
suggests the additional resource names (myfs-site.xml) in your case.

-Prashant


On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal
<[email protected]>wrote:

> Hi Prashant,
>
>
> Yes, I am running in MapReduce mode. Let me give you the steps in the
> scenario that I am trying to test -
>
> 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a
> filesystem I am trying to implement - Let's call it MyFileSystem.class.
> This filesystem uses the scheme myfs:// for its URIs
> 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and made
> the class available through a jar file that is part of HADOOP_CLASSPATH (or
> PIG_CLASSPATH).
> 3. In MyFileSystem.class, I have a static block as -
> static {
>     Configuration.addDefaultResource("myfs-default.xml");
>     Configuration.addDefaultResource("myfs-site.xml");
> }
> Both these files are in the classpath. To be safe, I have also added the
> my-fs-site.xml in the constructor of MyFileSystem as
> conf.addResource("myfs-site.xml"), so that it is part of both the default
> resources as well as the non-default resources in the Configuration object.
> 4. I am trying to access the filesystem in my pig script as -
> A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS
> (name:chararray, age:int); -- loading data
> B = FOREACH A GENERATE name;
> store B into 'myfs://myhost.com:8999/testoutput';
> 5. The execution seems to start correctly, and MyFileSystem.class is
> invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml
> is loaded and the properties defined in it are available.
> 6. However, when Pig tries to submit the job, it cannot find these
> properties and the job fails to submit successfully.
> 7. If I move all the properties defined in myfs-site.xml to core-site.xml,
> the job gets submitted successfully, and it even succeeds. However, this is
> not ideal as I do not want to proliferate core-site.xml with all of the
> properties for a separate filesystem.
> 8. As I said earlier, upon taking a closer look at the pig code, I saw
> that while creating the JobConf object for a job, pig adds very specific
> resources to the job object, and ignores the resources that may have been
> added already (eg myfs-site.xml) in the Configuration object.
> 9. I have tested this with native map-reduce code as well as hive, and
> this approach of having a separate config file for MyFileSystem works fine
> in both those cases.
>
> So, to summarize, I am looking for a way to ask Pig to load parameters
> from my own config file before submitting a job.
>
> Thanks,
> -
> Bhooshan.
>
>
>
> On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi 
> <[email protected]>wrote:
>
>> +User group
>>
>> Hi Bhooshan,
>>
>> By default you should be running in MapReduce mode unless specified
>> otherwise. Are you creating a PigServer object to run your jobs? Can you
>> provide your code here?
>>
>> Sent from my iPhone
>>
>> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]>
>> wrote:
>>
>>  Apologies for the premature send. I may have some more information.
>> After I applied the patch and set "pig.use.overriden.hadoop.configs=true",
>> I saw an NPE (stacktrace below) and a message saying pig was running in
>> exectype local -
>>
>> 2013-04-13 07:37:13,758 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>> to hadoop file system at: local
>> 2013-04-13 07:37:13,760 [main] WARN  org.apache.hadoop.conf.Configuration
>> - mapred.used.genericoptionsparser is deprecated. Instead, use
>> mapreduce.client.genericoptionsparser.used
>> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1200: Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>
>>
>> Here is the stacktrace =
>>
>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>> during parsing. Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606)
>>         at
>> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549)
>>         at org.apache.pig.PigServer.registerQuery(PigServer.java:549)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971)
>>         at
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>>         at
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>>         at org.apache.pig.Main.run(Main.java:555)
>>         at org.apache.pig.Main.main(Main.java:111)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> Caused by: Failed to parse: Pig script failed to parse:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at
>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184)
>>         at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598)
>>         ... 14 more
>> Caused by:
>> <file test.pig, line 1, column 4> pig script failed to validate:
>> java.lang.NullPointerException
>>         at
>> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507)
>>         at
>> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382)
>>         at
>> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
>>         ... 15 more
>>
>>
>>
>>
>> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <[email protected]
>> > wrote:
>>
>>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml.
>>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and
>>> Configuration.addResource.
>>>
>>> I see what you are saying though. The patch might require users to take
>>> care of adding the default config resources as well apart from their own
>>> resources?
>>>
>>>
>>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi <
>>> [email protected]> wrote:
>>>
>>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your
>>>> configuration resources?
>>>>
>>>>
>>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Prashant,
>>>>>
>>>>> Thanks for your response to my question, and sorry for the delayed
>>>>> reply. I was not subscribed to the dev mailing list and hence did not get 
>>>>> a
>>>>> notification about your reply. I have copied our thread below so you can
>>>>> get some context.
>>>>>
>>>>> I tried the patch that you pointed to, however with that patch looks
>>>>> like pig is unable to find core-site.xml. It indicates that it is running
>>>>> the script in local mode inspite of having fs.default.name defined as
>>>>> the location of the HDFS namenode.
>>>>>
>>>>> Here is what I am trying to do - I have developed my own
>>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in
>>>>> my pig script. This implementation requires its own *-default and
>>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as
>>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files,
>>>>> as I am able to read these configurations in my code. However, pig code
>>>>> cannot find these configuration parameters. Upon doing some debugging in
>>>>> the pig code, it seems to me that pig does not use all the resources added
>>>>> in the Configuration object, but only seems to use certain specific ones
>>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml,
>>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to
>>>>> have pig load user-defined resources like say foo-default.xml and
>>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as
>>>>> the problem, because pig can find my config parameters if I define them in
>>>>> core-site.xml instead of my-filesystem-site.xml.
>>>>>
>>>>> Let me know if you need more details about the issue.
>>>>>
>>>>>
>>>>> Here is our previous conversation -
>>>>>
>>>>> Hi Bhooshan,
>>>>>
>>>>> There is a patch that addresses what you need, and is part of 0.12
>>>>> (unreleased). Take a look and see if you can apply the patch to the 
>>>>> version
>>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135.
>>>>>
>>>>> With this patch, the following property will allow you to override the
>>>>> default and pass in your own configuration.
>>>>> pig.use.overriden.hadoop.configs=true
>>>>>
>>>>>
>>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal 
>>>>> <[email protected]>wrote:
>>>>>
>>>>> > Hi Folks,
>>>>> >
>>>>> > I had implemented the Hadoop FileSystem abstract class for a storage 
>>>>> > system
>>>>> > at work. This implementation uses some config files that are similar in
>>>>> > structure to hadoop config files. They have a *-default.xml and a
>>>>> > *-site.xml for users to override default properties. In the class that
>>>>> > implemented the Hadoop FileSystem, I had added these configuration 
>>>>> > files as
>>>>> > default resources in a static block using
>>>>> > Configuration.addDefaultResource("my-default.xml") and
>>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine 
>>>>> > and
>>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just 
>>>>> > fine
>>>>> > for our storage system. However, when we tried using this storage 
>>>>> > system in
>>>>> > pig scripts, we saw errors indicating that our configuration parameters
>>>>> > were not available. Upon further debugging, we saw that the config files
>>>>> > were added to the Configuration object as resources, but were part of
>>>>> > defaultResources. However, in Main.java in the pig source, we saw that 
>>>>> > the
>>>>> > Configuration object was created as Configuration conf = new
>>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf
>>>>> > object. As a result, properties from the default resources (including my
>>>>> > config files) were not loaded and hence, unavailable.
>>>>> >
>>>>> > We solved the problem by using Configuration.addResource instead of
>>>>> > Configuration.addDefaultResource, but still could not figure out why Pig
>>>>> > does not use default resources?
>>>>> >
>>>>> > Could someone on the list explain why this is the case?
>>>>> >
>>>>> > Thanks,
>>>>> > --
>>>>> > Bhooshan
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bhooshan
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bhooshan
>>>
>>
>>
>>
>> --
>> Bhooshan
>>
>>
>
>
> --
> Bhooshan
>

Re: Why does Pig not use default resources from the Configuration object?

Reply via email to