input file from tar.gz

2015-09-29 Thread Peter Rudenko
Hi, i have a huge tar.gz file on dfs. This file contains several files, but i want to use only one of them as input. Is it possible to filter somehow a tar.gz schema, something like this: sc.textFile("hdfs:///data/huge.tar.gz#input.txt") Thanks, Peter Rudenko

Re: input file from tar.gz

2015-09-29 Thread Ted Yu
The syntax using '#' is not supported by hdfs natively. YARN resource localization supports such notion. See http://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html Not sure about Spark. On Tue, Sep 29, 2015 at 11:39 AM, Peter