[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-09-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913154#action_12913154
 ] 

Alan Gates commented on PIG-1337:
-

The problem with allowing load and store functions access to the config file is 
that the config file they see is not the config file that goes to Hadoop.  This 
is not all Pig's fault (see comments above on this).  The other problem is that 
multiple instances of the same load and store function may be operating in a 
given script, so there are namespace issues to resolve.

The proposal for Hadoop 0.22 is that rather than providing access to the config 
file at all Hadoop will serialize objects such as InputFormat and OutputFormat 
and pass those to the backend.  It will make sense for Pig to follow suit and 
serialize all UDFs on the front end.  This will remove the need for the  
UDFContext black magic that we do at the moment and should allow all UDFs to 
easily transfer information from front end to backend.

So, hopefully this can get resolved when Pig migrates to Hadoop 0.22, whenever 
that is.

 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang

 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-04-01 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852445#action_12852445
 ] 

Chao Wang commented on PIG-1337:


It's ok for us not to use getSchema() for this purpose since it's a pure getter 
method.

What we need is simply a setter method in LoadFunc through which we can set up 
distributed cache. Pig needs to ensure that this information is indeed in the 
job configuration variable that's being passed to hadoop backend.
Also, this setter method should be only invoked at Pig's frondend.  In the case 
of one m/r job containing multiple LoadFunc instances, Pig may need to combine 
distributed cache configuration information from all instances.

Also, we note that using the UDFContext  to convey information from frontend to 
backend is not working for this.  We need the job configuration variable 
already contain all the distributed cache related information when it's being 
passed to the hadoop backend.




 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang
 Fix For: 0.8.0


 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-04-01 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852485#action_12852485
 ] 

Pradeep Kamath commented on PIG-1337:
-

We may need to add a new method - addToDistributedCache() on LoadFunc - 
notice this is an adder not a setter since there is only one key for 
distributed cache in hadoop's Job (Configuration in the Job). So 
implementations of loadfunc will have to use the DistributedCache.add*() 
methods.

 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang
 Fix For: 0.8.0


 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-03-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851479#action_12851479
 ] 

Pradeep Kamath commented on PIG-1337:
-

My worry in doing these kinds of job related updates in the Job in getSchema() 
is that currently getSchema has been designed to be a pure getter without any 
indirect set side effects - this is noted in the javadoc:

{noformat}
/**
 * Get a schema for the data to be loaded.  
 * @param location Location as returned by 
 * {...@link LoadFunc#relativeToAbsolutePath(String, 
org.apache.hadoop.fs.Path)}
 * @param job The {...@link Job} object - this should be used only to 
obtain 
 * cluster properties through {...@link Job#getConfiguration()} and not to 
set/query
 * any runtime job information.  
...
{noformat}

We should be careful in opening this up to allow set capability - something to 
consider before designing a fix for this issue.

 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang
 Fix For: 0.8.0


 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-03-29 Thread Chao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851105#action_12851105
 ] 

Chao Wang commented on PIG-1337:


This may also relate to https://issues.apache.org/jira/browse/MAPREDUCE-1620
Hadoop should serialize the Configration after the call to getSplits() to the 
backend such that any changes to the Configuration in getSplits() is serialized 
to the backend

But a cleaner solution from Pig's side is still worthwhile - so we can just 
rely on Pig's front end only calls, like getSchema() to do the setup job. 



 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang
 Fix For: 0.8.0


 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.