Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
SPARK-8458 is in 1.4.1 release.

You can upgrade to 1.4.1 or, wait for the upcoming 1.5.0 release.

On Sun, Aug 23, 2015 at 2:05 PM, lostrain A 
wrote:

> Hi Zhan,
>   Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
> looks like this is most likely the reason. I'll verify this again once the
> we make the upgrade.
>
> Best,
> los
>
> On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang 
> wrote:
>
>> If you are using spark-1.4.0, probably it is caused by SPARK-8458
>> 
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Aug 23, 2015, at 12:49 PM, lostrain A 
>> wrote:
>>
>> Ted,
>>   Thanks for the suggestions. Actually I tried both s3n and s3 and the
>> result remains the same.
>>
>>
>> On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu  wrote:
>>
>>> In your case, I would specify "fs.s3.awsAccessKeyId" /
>>> "fs.s3.awsSecretAccessKey" since you use s3 protocol.
>>>
>>> On Sun, Aug 23, 2015 at 11:03 AM, lostrain A <
>>> donotlikeworkingh...@gmail.com> wrote:
>>>
 Hi Ted,
   Thanks for the reply. I tried setting both of the keyid and accesskey
 via

 sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")


 However, the error still occurs for ORC format.

 If I change the format to JSON, although the error does not go, the
 JSON files can be saved successfully.




 On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu  wrote:

> You may have seen this:
> http://search-hadoop.com/m/q3RTtdSyM52urAyI
>
>
>
> On Aug 23, 2015, at 1:01 AM, lostrain A <
> donotlikeworkingh...@gmail.com> wrote:
>
> Hi,
>   I'm trying to save a simple dataframe to S3 in ORC format. The code
> is as follows:
>
>
>  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   import sqlContext.implicits._
>>   val df=sc.parallelize(1 to 1000).toDF()
>>   df.write.format("orc").save("s3://logs/dummy)
>
>
> I ran the above code in spark-shell and only the _SUCCESS file was
> saved under the directory.
> The last part of the spark-shell log said:
>
> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
>> task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
>> (100/100)
>>
>
>
>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
>> ResultStage 2 (save at :29) finished in 0.834 s
>>
>
>
>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
>> TaskSet 2.0, whose tasks have all completed, from pool
>>
>
>
>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
>> :29, took 0.895912 s
>>
>
>
>> 15/08/23 07:38:24 main INFO
>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
>> /media/ephemeral0/s3/output-
>>
>
>
>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
>> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>>
>
>
>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
>> committed.
>
>
> Anyone has experienced this before?
> Thanks!
>
>
>

>>>
>>
>>
>


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Zhan,
  Thanks for the point. Yes I'm using a cluster with spark-1.4.0 and it
looks like this is most likely the reason. I'll verify this again once the
we make the upgrade.

Best,
los

On Sun, Aug 23, 2015 at 1:25 PM, Zhan Zhang  wrote:

> If you are using spark-1.4.0, probably it is caused by SPARK-8458
> 
>
> Thanks.
>
> Zhan Zhang
>
> On Aug 23, 2015, at 12:49 PM, lostrain A 
> wrote:
>
> Ted,
>   Thanks for the suggestions. Actually I tried both s3n and s3 and the
> result remains the same.
>
>
> On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu  wrote:
>
>> In your case, I would specify "fs.s3.awsAccessKeyId" /
>> "fs.s3.awsSecretAccessKey" since you use s3 protocol.
>>
>> On Sun, Aug 23, 2015 at 11:03 AM, lostrain A <
>> donotlikeworkingh...@gmail.com> wrote:
>>
>>> Hi Ted,
>>>   Thanks for the reply. I tried setting both of the keyid and accesskey
>>> via
>>>
>>> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
 sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")
>>>
>>>
>>> However, the error still occurs for ORC format.
>>>
>>> If I change the format to JSON, although the error does not go, the JSON
>>> files can be saved successfully.
>>>
>>>
>>>
>>>
>>> On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu  wrote:
>>>
 You may have seen this:
 http://search-hadoop.com/m/q3RTtdSyM52urAyI



 On Aug 23, 2015, at 1:01 AM, lostrain A 
 wrote:

 Hi,
   I'm trying to save a simple dataframe to S3 in ORC format. The code
 is as follows:


  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   import sqlContext.implicits._
>   val df=sc.parallelize(1 to 1000).toDF()
>   df.write.format("orc").save("s3://logs/dummy)


 I ran the above code in spark-shell and only the _SUCCESS file was
 saved under the directory.
 The last part of the spark-shell log said:

 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
> task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
> (100/100)
>


> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
> ResultStage 2 (save at :29) finished in 0.834 s
>


> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
> TaskSet 2.0, whose tasks have all completed, from pool
>


> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
> :29, took 0.895912 s
>


> 15/08/23 07:38:24 main INFO
> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
> /media/ephemeral0/s3/output-
>


> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>


> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
> committed.


 Anyone has experienced this before?
 Thanks!



>>>
>>
>
>


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Zhan Zhang
If you are using spark-1.4.0, probably it is caused by 
SPARK-8458

Thanks.

Zhan Zhang

On Aug 23, 2015, at 12:49 PM, lostrain A 
mailto:donotlikeworkingh...@gmail.com>> wrote:

Ted,
  Thanks for the suggestions. Actually I tried both s3n and s3 and the result 
remains the same.


On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
In your case, I would specify "fs.s3.awsAccessKeyId" / 
"fs.s3.awsSecretAccessKey" since you use s3 protocol.

On Sun, Aug 23, 2015 at 11:03 AM, lostrain A 
mailto:donotlikeworkingh...@gmail.com>> wrote:
Hi Ted,
  Thanks for the reply. I tried setting both of the keyid and accesskey via

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")

However, the error still occurs for ORC format.

If I change the format to JSON, although the error does not go, the JSON files 
can be saved successfully.




On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI



On Aug 23, 2015, at 1:01 AM, lostrain A 
mailto:donotlikeworkingh...@gmail.com>> wrote:

Hi,
  I'm trying to save a simple dataframe to S3 in ORC format. The code is as 
follows:


 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
  import sqlContext.implicits._
  val df=sc.parallelize(1 to 1000).toDF()
  df.write.format("orc").save("s3://logs/dummy)

I ran the above code in spark-shell and only the _SUCCESS file was saved under 
the directory.
The last part of the spark-shell log said:

15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 95.0 
in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)

15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 
(save at :29) finished in 0.834 s

15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 2.0, 
whose tasks have all completed, from pool

15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at :29, 
took 0.895912 s

15/08/23 07:38:24 main INFO LocalDirAllocator$AllocatorPerContext$DirSelector: 
Returning directory: /media/ephemeral0/s3/output-

15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS is 
[-44, 29, -128, -39, -113, 0, -78,
 4, -23, -103, 9, -104, -20, -8, 66, 126]

15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ committed.

Anyone has experienced this before?
Thanks!







Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Ted,
  Thanks for the suggestions. Actually I tried both s3n and s3 and the
result remains the same.


On Sun, Aug 23, 2015 at 12:27 PM, Ted Yu  wrote:

> In your case, I would specify "fs.s3.awsAccessKeyId" /
> "fs.s3.awsSecretAccessKey" since you use s3 protocol.
>
> On Sun, Aug 23, 2015 at 11:03 AM, lostrain A <
> donotlikeworkingh...@gmail.com> wrote:
>
>> Hi Ted,
>>   Thanks for the reply. I tried setting both of the keyid and accesskey
>> via
>>
>> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
>>> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")
>>
>>
>> However, the error still occurs for ORC format.
>>
>> If I change the format to JSON, although the error does not go, the JSON
>> files can be saved successfully.
>>
>>
>>
>>
>> On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu  wrote:
>>
>>> You may have seen this:
>>> http://search-hadoop.com/m/q3RTtdSyM52urAyI
>>>
>>>
>>>
>>> On Aug 23, 2015, at 1:01 AM, lostrain A 
>>> wrote:
>>>
>>> Hi,
>>>   I'm trying to save a simple dataframe to S3 in ORC format. The code is
>>> as follows:
>>>
>>>
>>>  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
   import sqlContext.implicits._
   val df=sc.parallelize(1 to 1000).toDF()
   df.write.format("orc").save("s3://logs/dummy)
>>>
>>>
>>> I ran the above code in spark-shell and only the _SUCCESS file was saved
>>> under the directory.
>>> The last part of the spark-shell log said:
>>>
>>> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished
 task 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal
 (100/100)

>>>
>>>
 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
 ResultStage 2 (save at :29) finished in 0.834 s

>>>
>>>
 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
 TaskSet 2.0, whose tasks have all completed, from pool

>>>
>>>
 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
 :29, took 0.895912 s

>>>
>>>
 15/08/23 07:38:24 main INFO
 LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
 /media/ephemeral0/s3/output-

>>>
>>>
 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
 dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
  4, -23, -103, 9, -104, -20, -8, 66, 126]

>>>
>>>
 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
 committed.
>>>
>>>
>>> Anyone has experienced this before?
>>> Thanks!
>>>
>>>
>>>
>>
>


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
In your case, I would specify "fs.s3.awsAccessKeyId" /
"fs.s3.awsSecretAccessKey" since you use s3 protocol.

On Sun, Aug 23, 2015 at 11:03 AM, lostrain A  wrote:

> Hi Ted,
>   Thanks for the reply. I tried setting both of the keyid and accesskey via
>
> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
>> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")
>
>
> However, the error still occurs for ORC format.
>
> If I change the format to JSON, although the error does not go, the JSON
> files can be saved successfully.
>
>
>
>
> On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu  wrote:
>
>> You may have seen this:
>> http://search-hadoop.com/m/q3RTtdSyM52urAyI
>>
>>
>>
>> On Aug 23, 2015, at 1:01 AM, lostrain A 
>> wrote:
>>
>> Hi,
>>   I'm trying to save a simple dataframe to S3 in ORC format. The code is
>> as follows:
>>
>>
>>  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>   import sqlContext.implicits._
>>>   val df=sc.parallelize(1 to 1000).toDF()
>>>   df.write.format("orc").save("s3://logs/dummy)
>>
>>
>> I ran the above code in spark-shell and only the _SUCCESS file was saved
>> under the directory.
>> The last part of the spark-shell log said:
>>
>> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task
>>> 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)
>>>
>>
>>
>>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler:
>>> ResultStage 2 (save at :29) finished in 0.834 s
>>>
>>
>>
>>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
>>> TaskSet 2.0, whose tasks have all completed, from pool
>>>
>>
>>
>>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
>>> :29, took 0.895912 s
>>>
>>
>>
>>> 15/08/23 07:38:24 main INFO
>>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
>>> /media/ephemeral0/s3/output-
>>>
>>
>>
>>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
>>> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>>>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>>>
>>
>>
>>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
>>> committed.
>>
>>
>> Anyone has experienced this before?
>> Thanks!
>>
>>
>>
>


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi Ted,
  Thanks for the reply. I tried setting both of the keyid and accesskey via

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "***")
> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "**")


However, the error still occurs for ORC format.

If I change the format to JSON, although the error does not go, the JSON
files can be saved successfully.




On Sun, Aug 23, 2015 at 5:51 AM, Ted Yu  wrote:

> You may have seen this:
> http://search-hadoop.com/m/q3RTtdSyM52urAyI
>
>
>
> On Aug 23, 2015, at 1:01 AM, lostrain A 
> wrote:
>
> Hi,
>   I'm trying to save a simple dataframe to S3 in ORC format. The code is
> as follows:
>
>
>  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   import sqlContext.implicits._
>>   val df=sc.parallelize(1 to 1000).toDF()
>>   df.write.format("orc").save("s3://logs/dummy)
>
>
> I ran the above code in spark-shell and only the _SUCCESS file was saved
> under the directory.
> The last part of the spark-shell log said:
>
> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task
>> 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)
>>
>
>
>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage
>> 2 (save at :29) finished in 0.834 s
>>
>
>
>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed
>> TaskSet 2.0, whose tasks have all completed, from pool
>>
>
>
>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
>> :29, took 0.895912 s
>>
>
>
>> 15/08/23 07:38:24 main INFO
>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
>> /media/ephemeral0/s3/output-
>>
>
>
>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
>> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>>
>
>
>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
>> committed.
>
>
> Anyone has experienced this before?
> Thanks!
>
>
>


Re: Error when saving a dataframe as ORC file

2015-08-23 Thread Ted Yu
You may have seen this:
http://search-hadoop.com/m/q3RTtdSyM52urAyI



> On Aug 23, 2015, at 1:01 AM, lostrain A  
> wrote:
> 
> Hi,
>   I'm trying to save a simple dataframe to S3 in ORC format. The code is as 
> follows:
> 
> 
>>  val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   import sqlContext.implicits._
>>   val df=sc.parallelize(1 to 1000).toDF()
>>   df.write.format("orc").save("s3://logs/dummy)
> 
> I ran the above code in spark-shell and only the _SUCCESS file was saved 
> under the directory.
> The last part of the spark-shell log said:
> 
>> 15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task 
>> 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)
>  
>> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage 2 
>> (save at :29) finished in 0.834 s
>  
>> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet 
>> 2.0, whose tasks have all completed, from pool
>  
>> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at 
>> :29, took 0.895912 s
>  
>> 15/08/23 07:38:24 main INFO 
>> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory: 
>> /media/ephemeral0/s3/output-
>  
>> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for dummy/_SUCCESS 
>> is [-44, 29, -128, -39, -113, 0, -78,
>>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>  
>> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__ 
>> committed.
> 
> Anyone has experienced this before?
> Thanks!
>  


Error when saving a dataframe as ORC file

2015-08-23 Thread lostrain A
Hi,
  I'm trying to save a simple dataframe to S3 in ORC format. The code is as
follows:


 val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   import sqlContext.implicits._
>   val df=sc.parallelize(1 to 1000).toDF()
>   df.write.format("orc").save("s3://logs/dummy)


I ran the above code in spark-shell and only the _SUCCESS file was saved
under the directory.
The last part of the spark-shell log said:

15/08/23 07:38:23 task-result-getter-1 INFO TaskSetManager: Finished task
> 95.0 in stage 2.0 (TID 295) in 801 ms on ip-*-*-*-*.ec2.internal (100/100)
>


> 15/08/23 07:38:23 dag-scheduler-event-loop INFO DAGScheduler: ResultStage
> 2 (save at :29) finished in 0.834 s
>


> 15/08/23 07:38:23 task-result-getter-1 INFO YarnScheduler: Removed TaskSet
> 2.0, whose tasks have all completed, from pool
>


> 15/08/23 07:38:23 main INFO DAGScheduler: Job 2 finished: save at
> :29, took 0.895912 s
>


> 15/08/23 07:38:24 main INFO
> LocalDirAllocator$AllocatorPerContext$DirSelector: Returning directory:
> /media/ephemeral0/s3/output-
>


> 15/08/23 07:38:24 main ERROR NativeS3FileSystem: md5Hash for
> dummy/_SUCCESS is [-44, 29, -128, -39, -113, 0, -78,
>  4, -23, -103, 9, -104, -20, -8, 66, 126]
>


> 15/08/23 07:38:24 main INFO DefaultWriterContainer: Job job__
> committed.


Anyone has experienced this before?
Thanks!