I've tried uploading a zip file that contains a csv to hdfs and then read it into spark using spark-shell and the first line is all messed up. However when i upload a gzip to hdfs and then read it into spark it does just fine. See output below:
Is there a way to read a zip file as is from hdfs in spark? scala> val data = sc.textFile("hdfs://alexander1:9000/user/root/daily_42602_2014.csv.zip").cache data: org.apache.spark.rdd.RDD[String] = MappedRDD[7] at textFile at <console>:12 scala> data.first res6: String = PK????????� ?E����)�??���??? ?daily_42602_2014.csvUT ??�a�S�a�Sux ???�????�???��[� Ǖ� ���H�6ۻ�,�?~w�~?IH�.?�? ���V��J�?�t?Hg�}�� ��̕1"3R*��d]DR�?��p��1��_��o�}���_�����_?~�{�����_y��_�����ݯ�_�?�����o��}y����x���?�����������'���?������? �����_��}�)�����}������??�}(����|�<�������D��?�/û����������7��m����~����=�����s������Y����/� ����w �z����?� scala> val data = sc.textFile("hdfs://alexander1:9000/user/root/daily_42602_2014.csv.gz").cache data: org.apache.spark.rdd.RDD[String] = MappedRDD[9] at textFile at <console>:12 scala> data.first res7: String = "State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample Duration","Pollutant Standard","Date Local","Units of Measure","Event Type","Observation Count","Observation Percent","Arithmetic Mean","1st Max Value","1st Max Hour","AQI","Method Name","Local Site Name","Address","State Name","County Name","City Name","CBSA Name","Date of Last Change" --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org