Spark doesn't support zip file reading directly since this not distributable
file .
Read using Java.uti.zipInputStream api and prepare rdd .. ( 4GB Limit )
import java.util.zip.ZipInputStream
import scala.io.Source
import org.apache.spark.input.PortableDataStream
var zipPath = "s3://.... ABC.zip"
val rdd= sc.binaryFiles(zipPath).flatMap((file: (String,
PortableDataStream)) => {
var zipStream = new ZipInputStream(file._2.open)
val entry = zipStream.getNextEntry
var iter: Iterator[String] = null
iter = Source.fromInputStream(zipStream, "ISO_8859_1").getLines
iter
})
if zip file more than 4 GB use
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]