Re: Read multiple files from S3

Akhil Das Thu, 21 May 2015 00:34:47 -0700

textFile does reads all files in a directory.

We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
<https://github.com/sigmoidanalytics/spark-modified/blob/8074620414df6bbed81ac855067600573a7b22ca/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206>
which does that and implement something similar for your usecase.


Or if your job is just a batch job and you don't bother processing file by
file, then may be you can iterate over your list and create a sc.textFile
for each file entry and do the computing too. something like:

for(file <- fileNames){

 // Create sparkContext
 // do sc.textFile(file)
 // do your computing
 // sc.stop

}



Thanks
Best Regards

On Thu, May 21, 2015 at 1:45 AM, lovelylavs <lxn130...@utdallas.edu> wrote:

> Hi,
>
> I am trying to get a collection of files according to LastModifiedDate from
> S3
>
>     List <String>  FileNames = new ArrayList<String>();
>
> ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
>                     .withBucketName(s3_bucket)
>                     .withPrefix(logs_dir);
>
>             ObjectListing objectListing;
>
>
>             do {
>                 objectListing = s3Client.listObjects(listObjectsRequest);
>                 for (S3ObjectSummary objectSummary :
>                         objectListing.getObjectSummaries()) {
>
>                     if
> ((objectSummary.getLastModified().compareTo(dayBefore) > 0)  &&
> (objectSummary.getLastModified().compareTo(dayAfter) <1) &&
> objectSummary.getKey().contains(".log"))
>                         FileNames.add(objectSummary.getKey());
>                 }
>
> listObjectsRequest.setMarker(objectListing.getNextMarker());
>             } while (objectListing.isTruncated());
>
> I would like to process these files using Spark
>
> I understand that textFile reads a single text file. Is there any way to
> read all these files that are part of the List?
>
> Thanks for your help.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Read multiple files from S3

Reply via email to