textFile does reads all files in a directory.
We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
<https://github.com/sigmoidanalytics/spark-modified/blob/8074620414df6bbed81ac855067600573a7b22ca/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206>
which does that and implement something similar for your usecase.
Or if your job is just a batch job and you don't bother processing file by
file, then may be you can iterate over your list and create a sc.textFile
for each file entry and do the computing too. something like:
for(file <- fileNames){
// Create sparkContext
// do sc.textFile(file)
// do your computing
// sc.stop
}
Thanks
Best Regards
On Thu, May 21, 2015 at 1:45 AM, lovelylavs wrote:
> Hi,
>
> I am trying to get a collection of files according to LastModifiedDate from
> S3
>
> List FileNames = new ArrayList();
>
> ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
> .withBucketName(s3_bucket)
> .withPrefix(logs_dir);
>
> ObjectListing objectListing;
>
>
> do {
> objectListing = s3Client.listObjects(listObjectsRequest);
> for (S3ObjectSummary objectSummary :
> objectListing.getObjectSummaries()) {
>
> if
> ((objectSummary.getLastModified().compareTo(dayBefore) > 0) &&
> (objectSummary.getLastModified().compareTo(dayAfter) <1) &&
> objectSummary.getKey().contains(".log"))
> FileNames.add(objectSummary.getKey());
> }
>
> listObjectsRequest.setMarker(objectListing.getNextMarker());
> } while (objectListing.isTruncated());
>
> I would like to process these files using Spark
>
> I understand that textFile reads a single text file. Is there any way to
> read all these files that are part of the List?
>
> Thanks for your help.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>