Could you wrap the ZipInputStream in a List, since a subtype of TraversableOnce[?] is required?
case (name, content) => List(new ZipInputStream(content.open)) Xinh On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > Hi Sabarish, > > I found a similar posting online where I should use the S3 listKeys. > http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd. > Is this what you were thinking? > > And, your assumption is correct. The zipped CSV file contains only a > single file. I found this posting. > http://stackoverflow.com/questions/28969757/zip-support-in-apache-spark. > I see how to do the unzipping, but I cannot get it to work when running the > code directly. > > ... > import java.io.{ IOException, FileOutputStream, FileInputStream, File } > import java.util.zip.{ ZipEntry, ZipInputStream } > import org.apache.spark.input.PortableDataStream > > > sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") > sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey) > sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey) > > val zipFile = " > s3n://events/2016/03/01/00/event-20160301.000000-4877ff81-928f-4da4-89b6-6d40a28d61c7.csv.zip > " > val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name: String, > content: PortableDataStream) => new ZipInputStream(content.open) } > > <console>:95: error: type mismatch; > found : java.util.zip.ZipInputStream > required: TraversableOnce[?] > val zipFileRDD = sc.binaryFiles(zipFile).flatMap { case (name, > content) => new ZipInputStream(content.open) } > > ^ > > Thanks, > Ben > > On Mar 9, 2016, at 12:03 AM, Sabarish Sasidharan <sabarish....@gmail.com> > wrote: > > You can use S3's listKeys API and do a diff between consecutive listKeys > to identify what's new. > > Are there multiple files in each zip? Single file archives are processed > just like text as long as it is one of the supported compression formats. > > Regards > Sab > > On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> I am wondering if anyone can help. >> >> Our company stores zipped CSV files in S3, which has been a big headache >> from the start. I was wondering if anyone has created a way to iterate >> through several subdirectories (s3n://events/2016/03/01/00, >> s3n://2016/03/01/01, etc.) in S3 to find the newest files and load them. >> It would be a big bonus to include the unzipping of the file in the process >> so that the CSV can be loaded directly into a dataframe for further >> processing. I’m pretty sure that the S3 part of this request is not >> uncommon. I would think the file being zipped is uncommon. If anyone can >> help, I would truly be grateful for I am new to Scala and Spark. This would >> be a great help in learning. >> >> Thanks, >> Ben >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > >