Re: S3 Zip File Loading Advice

Xinh Huynh Wed, 09 Mar 2016 10:11:37 -0800

Could you wrap the ZipInputStream in a List, since a subtype of
TraversableOnce[?] is required?


case (name, content) => List(new ZipInputStream(content.open))

Xinh

On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Sabarish,
>
> I found a similar posting online where I should use the S3 listKeys.
> http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd.
> Is this what you were thinking?
>
> And, your assumption is correct. The zipped CSV file contains only a
> single file. I found this posting.
> http://stackoverflow.com/questions/28969757/zip-support-in-apache-spark.
> I see how to do the unzipping, but I cannot get it to work when running the
> code directly.
>
> ...
> import java.io.{ IOException, FileOutputStream, FileInputStream, File }
> import java.util.zip.{ ZipEntry, ZipInputStream }
> import org.apache.spark.input.PortableDataStream
>
>
> sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)
>
> val zipFile = "
> s3n://events/2016/03/01/00/event-20160301.000000-4877ff81-928f-4da4-89b6-6d40a28d61c7.csv.zip
> "
> val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name: String,
> content: PortableDataStream) => new ZipInputStream(content.open) }
>
> <console>:95: error: type mismatch;
>  found   : java.util.zip.ZipInputStream
>  required: TraversableOnce[?]
>          val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name,
> content) => new ZipInputStream(content.open) }
>
>                                                       ^
>
> Thanks,
> Ben
>
> On Mar 9, 2016, at 12:03 AM, Sabarish Sasidharan <sabarish....@gmail.com>
> wrote:
>
> You can use S3's listKeys API and do a diff between consecutive listKeys
> to identify what's new.
>
> Are there multiple files in each zip? Single file archives are processed
> just like text as long as it is one of the supported compression formats.
>
> Regards
> Sab
>
> On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> I am wondering if anyone can help.
>>
>> Our company stores zipped CSV files in S3, which has been a big headache
>> from the start. I was wondering if anyone has created a way to iterate
>> through several subdirectories (s3n://events/2016/03/01/00,
>> s3n://2016/03/01/01, etc.) in S3 to find the newest files and load them.
>> It would be a big bonus to include the unzipping of the file in the process
>> so that the CSV can be loaded directly into a dataframe for further
>> processing. I’m pretty sure that the S3 part of this request is not
>> uncommon. I would think the file being zipped is uncommon. If anyone can
>> help, I would truly be grateful for I am new to Scala and Spark. This would
>> be a great help in learning.
>>
>> Thanks,
>> Ben
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Re: S3 Zip File Loading Advice

Reply via email to