I've tried uploading a zip file that contains a csv to hdfs and then
read it into spark using spark-shell and the first line is all messed
up. However when i upload a gzip to hdfs and then read it into spark
it does just fine. See output below:

Is there a way to read a zip file as is from hdfs in spark?


scala> val data =
sc.textFile("hdfs://alexander1:9000/user/root/daily_42602_2014.csv.zip").cache
data: org.apache.spark.rdd.RDD[String] = MappedRDD[7] at textFile at
<console>:12

scala> data.first
res6: String = PK????????� ?E����)�??���??? ?daily_42602_2014.csvUT
??�a�S�a�Sux ???�????�???��[� Ǖ� ���H�6ۻ�,�?~w�~?IH�.?�?
���V��J�?�t?Hg�}��
��̕1"3R*��d]DR�?��p��1��_��o�}���_�����_?~�{�����_y��_�����ݯ�_�?�����o��}y����x���?�����������'���?������?
�����_��}�)�����}������??�}(����|�<�������D��?�/û����������7��m����~����=�����s������Y����/�
����w �z����?�
scala> val data =
sc.textFile("hdfs://alexander1:9000/user/root/daily_42602_2014.csv.gz").cache
data: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at
<console>:12

scala> data.first
res7: String = "State Code","County Code","Site Num","Parameter
Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample
Duration","Pollutant Standard","Date Local","Units of Measure","Event
Type","Observation Count","Observation Percent","Arithmetic Mean","1st
Max Value","1st Max Hour","AQI","Method Name","Local Site
Name","Address","State Name","County Name","City Name","CBSA
Name","Date of Last Change"

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to