Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work?

Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 method delegates to sc.hadoopFile(), which uses Hadoop's

Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Not that I know of. We were discussing it on another thread and it came up. I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html But that's not