Re: File list read into single RDD

2014-05-21 Thread Pat Ferrel
Thanks this really helps. 

As long as I stick to HDFS paths, and files I’m good. I do know that code a bit 
but have never used it to say take input from one cluster via 
“hdfs://server:port/path” and output to another via 
“hdfs://another-server:another-port/path”. This seems to be supported by Spark 
so I’ll have to go back and look at how to do this in the HDFS api.

Specifically I’ll need to examine the directory/file structure on one cluster 
then check some things on what is potentially another cluster before output. I 
have usually assumed only one HDFS instance so it may just be a matter of me 
being more careful and preserving full URIs. In the past I may have made 
assumptions that output is to the same dir tree as the input. Maybe it’s a 
matter of being more scrupulous about that assumption.

It’s a bit hard to test this case since I have never really had access to two 
clusters so I’ll have to develop some new habits at least.

On May 18, 2014, at 11:13 AM, Andrew Ash and...@andrewash.com wrote:

Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's 
FileInputFormat.setInputPaths() call.  There is no alternate storage system, 
Spark just delegates to Hadoop for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so you 
can use Spark on data in S3 using s3:/// just the same as you would with HDFS.  
See Apache's documentation on S3 for more details.

As far as interacting with a FileSystem (HDFS or other) to list files, delete 
files, navigate paths, etc. from your driver program, you should be able to 
just instantiate a FileSystem object and use the normal Hadoop APIs from there. 
 The Apache getting started docs on reading/writing from Hadoop DFS should work 
the same for non-HDFS examples too.

I do think we could use a little recipe in our documentation to make 
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind sharing, we 
can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com 
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
 Not that I know of. We were discussing it on another thread and it came up. 
 
 I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
 you'll see it mentioned there in the docs. 
 
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
 
 But that's not obvious.
 
 Nick
 
 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:
 Perfect. 
 
 BTW just so I know where to look next time, was that in some docs?
 
 On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 
 Yep, as I just found out, you can also provide 
 sc.textFile() with a comma-delimited string of all the files you want to load.
 
 For example:
 
 sc.textFile('/path/to/file1,/path/to/file2')
 So once you have your list of files, concatenate their paths like that and 
 pass the single string to 
 textFile().
 
 Nick
 
 
 
 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 sc.textFile(URI) supports reading multiple files in parallel but only with a 
 wildcard. I need to walk a dir tree, match a regex to create a list of files, 
 then I’d like to read them into a single RDD in parallel. I understand these 
 could go into separate RDDs then a union RDD can be created. Is there a way 
 to create a single RDD from a URI list?
 
 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 

Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud christophe.pre...@kelkoo.com 
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
 Not that I know of. We were discussing it on another thread and it came up. 
 
 I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
 you'll see it mentioned there in the docs. 
 
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
 
 But that's not obvious.
 
 Nick
 
 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:
 Perfect. 
 
 BTW just so I know where to look next time, was that in some docs?
 
 On Apr 28, 2014, at 7:04 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 
 Yep, as I just found out, you can also provide 
 sc.textFile() with a comma-delimited string of all the files you want to load.
 
 For example:
 
 sc.textFile('/path/to/file1,/path/to/file2')
 So once you have your list of files, concatenate their paths like that and 
 pass the single string to 
 textFile().
 
 Nick
 
 
 
 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 sc.textFile(URI) supports reading multiple files in parallel but only with a 
 wildcard. I need to walk a dir tree, match a regex to create a list of files, 
 then I’d like to read them into a single RDD in parallel. I understand these 
 could go into separate RDDs then a union RDD can be created. Is there a way 
 to create a single RDD from a URI list?
 
 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's 
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546call.
 There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS.  See Apache's documentation on
S3https://wiki.apache.org/hadoop/AmazonS3 for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there.  The Apache getting started docs on reading/writing from Hadoop
DFS https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample should work
the same for non-HDFS examples too.

I do think we could use a little recipe in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
 Since Spark supports several FS schemes I’m unclear about how much to
 assume about using the hadoop file systems APIs and conventions. Concretely
 if I pass a pattern in with a HTTPS file system, will the pattern work?

 How does Spark implement its storage system? This seems to be an
 abstraction level beyond what is available in HDFS. In order to preserve
 that flexibility what APIs should I be using? It would be easy to say, HDFS
 only and use HDFS APIs but that would seem to limit things. Especially
 where you would like to read from one cluster and write to another. This is
 not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.

 If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
 examine the structure of the file system, I’m unclear how I should do it
 without sacrificing Spark’s flexibility.

 On Apr 29, 2014, at 12:55 AM, Christophe Préaud 
 christophe.pre...@kelkoo.com wrote:

  Hi,

 You can also use any path pattern as defined here:
 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

 e.g.:

 sc.textFile('{/path/to/file1,/path/to/file2}')

 Christophe.

 On 29/04/2014 05:07, Nicholas Chammas wrote:

 Not that I know of. We were discussing it on another thread and it came
 up.

  I think if you look up the Hadoop FileInputFormat API (which Spark uses)
 you'll see it mentioned there in the docs.


 http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

  But that's not obvious.

  Nick

 2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com 님이 작성한 메시지:

 Perfect.

  BTW just so I know where to look next time, was that in some docs?

   On Apr 28, 2014, at 7:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Yep, as I just found out, you can also provide sc.textFile() with a
 comma-delimited string of all the files you want to load.

 For example:

 sc.textFile('/path/to/file1,/path/to/file2')

 So once you have your list of files, concatenate their paths like that
 and pass the single string to textFile().

 Nick


 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 sc.textFile(URI) supports reading multiple files in parallel but only
 with a wildcard. I need to walk a dir tree, match a regex to create a list
 of files, then I’d like to read them into a single RDD in parallel. I
 understand these could go into separate RDDs then a union RDD can be
 created. Is there a way to create a single RDD from a URI list?





 --
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.




Re: File list read into single RDD

2014-04-28 Thread Nicholas Chammas
Not that I know of. We were discussing it on another thread and it came up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrelpat.fer...@gmail.com님이 작성한 메시지:

 Perfect.

 BTW just so I know where to look next time, was that in some docs?

 On Apr 28, 2014, at 7:04 PM, Nicholas Chammas 
 nicholas.cham...@gmail.comjavascript:_e(%7B%7D,'cvml','nicholas.cham...@gmail.com');
 wrote:

 Yep, as I just found out, you can also provide sc.textFile() with a
 comma-delimited string of all the files you want to load.

 For example:

 sc.textFile('/path/to/file1,/path/to/file2')

 So once you have your list of files, concatenate their paths like that and
 pass the single string to textFile().

 Nick


 On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel 
 pat.fer...@gmail.comjavascript:_e(%7B%7D,'cvml','pat.fer...@gmail.com');
  wrote:

 sc.textFile(URI) supports reading multiple files in parallel but only
 with a wildcard. I need to walk a dir tree, match a regex to create a list
 of files, then I’d like to read them into a single RDD in parallel. I
 understand these could go into separate RDDs then a union RDD can be
 created. Is there a way to create a single RDD from a URI list?