Hi Bhooshan, This makes more sense now. I think overriding fs implementation should go into core-site.xml, but it would be useful to be able to add resources if you have a bunch of other properties.
Would you like to submit a patch? It should be based on a pig property that suggests the additional resource names (myfs-site.xml) in your case. -Prashant On Mon, Apr 15, 2013 at 10:35 AM, Bhooshan Mogal <[email protected]>wrote: > Hi Prashant, > > > Yes, I am running in MapReduce mode. Let me give you the steps in the > scenario that I am trying to test - > > 1. I have my own implementation of org.apache.hadoop.fs.FileSystem for a > filesystem I am trying to implement - Let's call it MyFileSystem.class. > This filesystem uses the scheme myfs:// for its URIs > 2. I have set fs.myfs.impl to MyFileSystem.class in core-site.xml and made > the class available through a jar file that is part of HADOOP_CLASSPATH (or > PIG_CLASSPATH). > 3. In MyFileSystem.class, I have a static block as - > static { > Configuration.addDefaultResource("myfs-default.xml"); > Configuration.addDefaultResource("myfs-site.xml"); > } > Both these files are in the classpath. To be safe, I have also added the > my-fs-site.xml in the constructor of MyFileSystem as > conf.addResource("myfs-site.xml"), so that it is part of both the default > resources as well as the non-default resources in the Configuration object. > 4. I am trying to access the filesystem in my pig script as - > A = LOAD 'myfs://myhost.com:8999/testdata' USING PigStorage(':') AS > (name:chararray, age:int); -- loading data > B = FOREACH A GENERATE name; > store B into 'myfs://myhost.com:8999/testoutput'; > 5. The execution seems to start correctly, and MyFileSystem.class is > invoked correctly. In MyFileSystem.class, I can also see that myfs-site.xml > is loaded and the properties defined in it are available. > 6. However, when Pig tries to submit the job, it cannot find these > properties and the job fails to submit successfully. > 7. If I move all the properties defined in myfs-site.xml to core-site.xml, > the job gets submitted successfully, and it even succeeds. However, this is > not ideal as I do not want to proliferate core-site.xml with all of the > properties for a separate filesystem. > 8. As I said earlier, upon taking a closer look at the pig code, I saw > that while creating the JobConf object for a job, pig adds very specific > resources to the job object, and ignores the resources that may have been > added already (eg myfs-site.xml) in the Configuration object. > 9. I have tested this with native map-reduce code as well as hive, and > this approach of having a separate config file for MyFileSystem works fine > in both those cases. > > So, to summarize, I am looking for a way to ask Pig to load parameters > from my own config file before submitting a job. > > Thanks, > - > Bhooshan. > > > > On Fri, Apr 12, 2013 at 9:57 PM, Prashant Kommireddi > <[email protected]>wrote: > >> +User group >> >> Hi Bhooshan, >> >> By default you should be running in MapReduce mode unless specified >> otherwise. Are you creating a PigServer object to run your jobs? Can you >> provide your code here? >> >> Sent from my iPhone >> >> On Apr 12, 2013, at 6:23 PM, Bhooshan Mogal <[email protected]> >> wrote: >> >> Apologies for the premature send. I may have some more information. >> After I applied the patch and set "pig.use.overriden.hadoop.configs=true", >> I saw an NPE (stacktrace below) and a message saying pig was running in >> exectype local - >> >> 2013-04-13 07:37:13,758 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting >> to hadoop file system at: local >> 2013-04-13 07:37:13,760 [main] WARN org.apache.hadoop.conf.Configuration >> - mapred.used.genericoptionsparser is deprecated. Instead, use >> mapreduce.client.genericoptionsparser.used >> 2013-04-13 07:37:14,162 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1200: Pig script failed to parse: >> <file test.pig, line 1, column 4> pig script failed to validate: >> java.lang.NullPointerException >> >> >> Here is the stacktrace = >> >> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error >> during parsing. Pig script failed to parse: >> <file test.pig, line 1, column 4> pig script failed to validate: >> java.lang.NullPointerException >> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1606) >> at >> org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1549) >> at org.apache.pig.PigServer.registerQuery(PigServer.java:549) >> at >> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:971) >> at >> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) >> at >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190) >> at >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) >> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) >> at org.apache.pig.Main.run(Main.java:555) >> at org.apache.pig.Main.main(Main.java:111) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:616) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >> Caused by: Failed to parse: Pig script failed to parse: >> <file test.pig, line 1, column 4> pig script failed to validate: >> java.lang.NullPointerException >> at >> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) >> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1598) >> ... 14 more >> Caused by: >> <file test.pig, line 1, column 4> pig script failed to validate: >> java.lang.NullPointerException >> at >> org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:438) >> at >> org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3168) >> at >> org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291) >> at >> org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789) >> at >> org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507) >> at >> org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382) >> at >> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177) >> ... 15 more >> >> >> >> >> On Fri, Apr 12, 2013 at 6:16 PM, Bhooshan Mogal <[email protected] >> > wrote: >> >>> Yes, however I did not add core-site.xml, hdfs-site.xml, yarn-site.xml. >>> Only my-filesystem-site.xml using both Configuration.addDefaultResource and >>> Configuration.addResource. >>> >>> I see what you are saying though. The patch might require users to take >>> care of adding the default config resources as well apart from their own >>> resources? >>> >>> >>> On Fri, Apr 12, 2013 at 6:06 PM, Prashant Kommireddi < >>> [email protected]> wrote: >>> >>>> Did you set "pig.use.overriden.hadoop.configs=true" and then add your >>>> configuration resources? >>>> >>>> >>>> On Fri, Apr 12, 2013 at 5:32 PM, Bhooshan Mogal < >>>> [email protected]> wrote: >>>> >>>>> Hi Prashant, >>>>> >>>>> Thanks for your response to my question, and sorry for the delayed >>>>> reply. I was not subscribed to the dev mailing list and hence did not get >>>>> a >>>>> notification about your reply. I have copied our thread below so you can >>>>> get some context. >>>>> >>>>> I tried the patch that you pointed to, however with that patch looks >>>>> like pig is unable to find core-site.xml. It indicates that it is running >>>>> the script in local mode inspite of having fs.default.name defined as >>>>> the location of the HDFS namenode. >>>>> >>>>> Here is what I am trying to do - I have developed my own >>>>> org.apache.hadoop.fs.FileSystem implementation and am trying to use it in >>>>> my pig script. This implementation requires its own *-default and >>>>> *-site.xml files. I have added the path to these files in PIG_CLASSPATH as >>>>> well as HADOOP_CLASSPATH and can confirm that hadoop can find these files, >>>>> as I am able to read these configurations in my code. However, pig code >>>>> cannot find these configuration parameters. Upon doing some debugging in >>>>> the pig code, it seems to me that pig does not use all the resources added >>>>> in the Configuration object, but only seems to use certain specific ones >>>>> like hadoop-site, core-site, pig-cluster-hadoop-site.xml,yarn-site.xml, >>>>> hdfs-site.xml (I am looking at HExecutionEngine.java). Is it possible to >>>>> have pig load user-defined resources like say foo-default.xml and >>>>> foo-site.xml while creating the JobConf object? I am narrowing on this as >>>>> the problem, because pig can find my config parameters if I define them in >>>>> core-site.xml instead of my-filesystem-site.xml. >>>>> >>>>> Let me know if you need more details about the issue. >>>>> >>>>> >>>>> Here is our previous conversation - >>>>> >>>>> Hi Bhooshan, >>>>> >>>>> There is a patch that addresses what you need, and is part of 0.12 >>>>> (unreleased). Take a look and see if you can apply the patch to the >>>>> version >>>>> you are using.https://issues.apache.org/jira/browse/PIG-3135. >>>>> >>>>> With this patch, the following property will allow you to override the >>>>> default and pass in your own configuration. >>>>> pig.use.overriden.hadoop.configs=true >>>>> >>>>> >>>>> On Thu, Mar 28, 2013 at 6:10 PM, Bhooshan Mogal >>>>> <[email protected]>wrote: >>>>> >>>>> > Hi Folks, >>>>> > >>>>> > I had implemented the Hadoop FileSystem abstract class for a storage >>>>> > system >>>>> > at work. This implementation uses some config files that are similar in >>>>> > structure to hadoop config files. They have a *-default.xml and a >>>>> > *-site.xml for users to override default properties. In the class that >>>>> > implemented the Hadoop FileSystem, I had added these configuration >>>>> > files as >>>>> > default resources in a static block using >>>>> > Configuration.addDefaultResource("my-default.xml") and >>>>> > Configuration.addDefaultResource("my-site.xml". This was working fine >>>>> > and >>>>> > we were able to run the Hadoop Filesystem CLI and map-reduce jobs just >>>>> > fine >>>>> > for our storage system. However, when we tried using this storage >>>>> > system in >>>>> > pig scripts, we saw errors indicating that our configuration parameters >>>>> > were not available. Upon further debugging, we saw that the config files >>>>> > were added to the Configuration object as resources, but were part of >>>>> > defaultResources. However, in Main.java in the pig source, we saw that >>>>> > the >>>>> > Configuration object was created as Configuration conf = new >>>>> > Configuration(false);, thereby setting loadDefaults to false in the conf >>>>> > object. As a result, properties from the default resources (including my >>>>> > config files) were not loaded and hence, unavailable. >>>>> > >>>>> > We solved the problem by using Configuration.addResource instead of >>>>> > Configuration.addDefaultResource, but still could not figure out why Pig >>>>> > does not use default resources? >>>>> > >>>>> > Could someone on the list explain why this is the case? >>>>> > >>>>> > Thanks, >>>>> > -- >>>>> > Bhooshan >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Bhooshan >>>>> >>>> >>>> >>> >>> >>> -- >>> Bhooshan >>> >> >> >> >> -- >> Bhooshan >> >> > > > -- > Bhooshan >
