[jira] [Commented] (PIG-5360) Pig sets working directory of input file systems causes exception thrown

2019-04-03 Thread Xuzhou Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809237#comment-16809237
 ] 

Xuzhou Yin commented on PIG-5360:
-

This issue has been opened for a while. It would be a great appreciate is 
anyone can take a look... Thanks a lot!

> Pig sets working directory of input file systems causes exception thrown
> 
>
> Key: PIG-5360
> URL: https://issues.apache.org/jira/browse/PIG-5360
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.17.0
>Reporter: Xuzhou Yin
>Priority: Minor
>  Labels: patch
> Fix For: 0.18.0
>
> Attachments: PIG-5360.diff
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> {color:#00}In getSplits() method in PigInputFormat, Pig is trying to set 
> the working directory of input File System to 
> jobContext.getWorkingDirectory(), which is always the default working 
> directory of default file system (eg. hdfs://host:port/user/userId in case of 
> HDFS) unless “mapreduce.job.working.dir” is explicitly set to non-default 
> value. So if the input path uses non-default file system, then it will fail 
> since it is trying to set the working directory of non-default file system to 
> a HDFS path.{color}
> {color:#00}The proposed change is to completely remove this logic of 
> setting working directory. There are several reasons for doing so. {color}
> {color:#00}Firstly, getSplits() is only supposed to return a list of 
> input splits. It should not have side effects (especially doing so can 
> potentially change the output path). Having InputFormat changes OutputFormat 
> does not make much sense here.
> {color}
> {color:#00}Secondly, there is inconsistency between the working 
> directories of input and output file systems. if "mapreduce.job.working.dir" 
> is set to non-default value, it will affect the output path only (if it is a 
> relative path) because input path will be made qualified even before this 
> logic.{color}
> {color:#00}Thirdly, there is already a "CD" functionality that allows 
> customers to change the working directory. However, this logic will overwrite 
> the "CD" functionality if input and output paths both use default file 
> system.{color}
> {color:#00}Lastly, if customer has a sequence of jobs, changing the 
> working directory may change the input paths of downstream jobs if the input 
> paths are specified as relative{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5360) Pig sets working directory of input file systems causes exception thrown

2018-10-25 Thread Xuzhou Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664129#comment-16664129
 ] 

Xuzhou Yin commented on PIG-5360:
-

Can someone review this patch? Thanks a lot!

> Pig sets working directory of input file systems causes exception thrown
> 
>
> Key: PIG-5360
> URL: https://issues.apache.org/jira/browse/PIG-5360
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.17.0
>Reporter: Xuzhou Yin
>Priority: Minor
>  Labels: patch
> Fix For: 0.18.0
>
> Attachments: PIG-5360.diff
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> {color:#00}In getSplits() method in PigInputFormat, Pig is trying to set 
> the working directory of input File System to 
> jobContext.getWorkingDirectory(), which is always the default working 
> directory of default file system (eg. hdfs://host:port/user/userId in case of 
> HDFS) unless “mapreduce.job.working.dir” is explicitly set to non-default 
> value. So if the input path uses non-default file system, then it will fail 
> since it is trying to set the working directory of non-default file system to 
> a HDFS path.{color}
> {color:#00}The proposed change is to completely remove this logic of 
> setting working directory. There are several reasons for doing so. {color}
> {color:#00}Firstly, getSplits() is only supposed to return a list of 
> input splits. It should not have side effects (especially doing so can 
> potentially change the output path). Having InputFormat changes OutputFormat 
> does not make much sense here.
> {color}
> {color:#00}Secondly, there is inconsistency between the working 
> directories of input and output file systems. if "mapreduce.job.working.dir" 
> is set to non-default value, it will affect the output path only (if it is a 
> relative path) because input path will be made qualified even before this 
> logic.{color}
> {color:#00}Thirdly, there is already a "CD" functionality that allows 
> customers to change the working directory. However, this logic will overwrite 
> the "CD" functionality if input and output paths both use default file 
> system.{color}
> {color:#00}Lastly, if customer has a sequence of jobs, changing the 
> working directory may change the input paths of downstream jobs if the input 
> paths are specified as relative{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)