Sebastian Nagel created NUTCH-2281:
--------------------------------------

             Summary: Support non-default FileSystem
                 Key: NUTCH-2281
                 URL: https://issues.apache.org/jira/browse/NUTCH-2281
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.12
            Reporter: Sebastian Nagel
             Fix For: 1.13


If a path (input or output) does not belong to the configured default 
FileSystem various Nutch tools may raise an exception like
{noformat}
  Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., 
expected: hdfs://...
{noformat}

This is fixed by getting a reference to the FileSystem from the Path object
{noformat}
  FileSystem fs = path.getFileSystem(getConf());
{noformat}
instead of
{noformat}
  FileSystem fs = FileSystem.get(getConf());
{noformat}
A given path (e.g., {{s3a://...}}) may not belong to the default file system 
({{hdfs://}} or {{file://}} in local mode) and simple checks such as 
{{fs.exists(path)}} then will fail. Cf. 
[FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
 and 
[FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
 vs. 
[FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
 which is called by 
[Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
  
Note that the FileSystem for input and output may be different, e.g., read from 
HDFS and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to