Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory.

We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
<https://github.com/sigmoidanalytics/spark-modified/blob/8074620414df6bbed81ac855067600573a7b22ca/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206>
which does that and implement something similar for your usecase.

Or if your job is just a batch job and you don't bother processing file by
file, then may be you can iterate over your list and create a sc.textFile
for each file entry and do the computing too. something like:

for(file <- fileNames){

 // Create sparkContext
 // do sc.textFile(file)
 // do your computing
 // sc.stop

}



Thanks
Best Regards

On Thu, May 21, 2015 at 1:45 AM, lovelylavs  wrote:

> Hi,
>
> I am trying to get a collection of files according to LastModifiedDate from
> S3
>
> List   FileNames = new ArrayList();
>
> ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
> .withBucketName(s3_bucket)
> .withPrefix(logs_dir);
>
> ObjectListing objectListing;
>
>
> do {
> objectListing = s3Client.listObjects(listObjectsRequest);
> for (S3ObjectSummary objectSummary :
> objectListing.getObjectSummaries()) {
>
> if
> ((objectSummary.getLastModified().compareTo(dayBefore) > 0)  &&
> (objectSummary.getLastModified().compareTo(dayAfter) <1) &&
> objectSummary.getKey().contains(".log"))
> FileNames.add(objectSummary.getKey());
> }
>
> listObjectsRequest.setMarker(objectListing.getNextMarker());
> } while (objectListing.isTruncated());
>
> I would like to process these files using Spark
>
> I understand that textFile reads a single text file. Is there any way to
> read all these files that are part of the List?
>
> Thanks for your help.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Read multiple files from S3

2015-05-20 Thread lovelylavs
Hi,

I am trying to get a collection of files according to LastModifiedDate from
S3

List   FileNames = new ArrayList();

ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(s3_bucket)
.withPrefix(logs_dir);

ObjectListing objectListing;


do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary :
objectListing.getObjectSummaries()) {

if
((objectSummary.getLastModified().compareTo(dayBefore) > 0)  &&
(objectSummary.getLastModified().compareTo(dayAfter) <1) &&
objectSummary.getKey().contains(".log"))
FileNames.add(objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

I would like to process these files using Spark

I understand that textFile reads a single text file. Is there any way to
read all these files that are part of the List?

Thanks for your help.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org