Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Cheolsoo Park
Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926

On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:

> Ran into this need myself. Does Spark have an equivalent of  "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
>
> On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am wondering if anyone has successfully enabled
>> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I
>> usually set this property to 25 to speed up file listing in MR jobs (Hive
>> and Pig). But for some reason, this property does not take effect in Spark
>> HadoopRDD resulting in serious delay in file listing.
>>
>> I verified that the property is indeed set in HadoopRDD by logging the
>> value of the property in the getPartitions() function. I also tried to
>> attach VisualVM to Spark and Pig clients, which look as follows-
>>
>> In Pig, I can see 25 threads running in parallel for file listing-
>> [image: Inline image 1]
>>
>> In Spark, I only see 2 threads running in parallel for file listing-
>> [image: Inline image 2]
>>
>> What's strange is that the # of concurrent threads in Spark is throttled
>> no matter how high I
>> set "mapreduce.input.fileinputformat.list-status.num-threads".
>>
>> Is anyone using Spark with this property enabled? If so, can you please
>> share how you do it?
>>
>> Thanks!
>> Cheolsoo
>>
>
>


Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2015-07-23 Thread Cheolsoo Park
Hi,

I am wondering if anyone has successfully enabled
mapreduce.input.fileinputformat.list-status.num-threads in Spark jobs. I
usually set this property to 25 to speed up file listing in MR jobs (Hive
and Pig). But for some reason, this property does not take effect in Spark
HadoopRDD resulting in serious delay in file listing.

I verified that the property is indeed set in HadoopRDD by logging the
value of the property in the getPartitions() function. I also tried to
attach VisualVM to Spark and Pig clients, which look as follows-

In Pig, I can see 25 threads running in parallel for file listing-
[image: Inline image 1]

In Spark, I only see 2 threads running in parallel for file listing-
[image: Inline image 2]

What's strange is that the # of concurrent threads in Spark is throttled no
matter how high I
set mapreduce.input.fileinputformat.list-status.num-threads.

Is anyone using Spark with this property enabled? If so, can you please
share how you do it?

Thanks!
Cheolsoo


Re: SparkSQL failing while writing into S3 for 'insert into table'

2015-05-23 Thread Cheolsoo Park
 It seems it generated query results into tmp dir firstly, and tries to rename
it into the right folder finally. But, it failed while renaming it.

This problem exists not only in SparkSQL but also in any Hadoop tools (e.g.
Hive, Pig, etc) when using with s3. Usually, It is better to write task
outputs to local disk and copy them to the final S3 location in the task
commit phase. In fact, this is how EMR Hive does insert overwrite, and
that's why EMR Hive works well with S3 while Apache Hive doesn't.

If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop
task. Basically, you can modify that code to make it write to local disk
first and then commit to the final s3 location. Actually, I am doing the
same at work in 1.4 branch.


On Fri, May 22, 2015 at 5:50 PM, ogoh oke...@gmail.com wrote:


 Hello,
 I am using spark 1.3  Hive 0.13.1 in AWS.
 From Spark-SQL, when running Hive query to export Hive query result into
 AWS
 S3, it failed with the following message:
 ==
 org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:

 s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1
 has nested

 directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary

 at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)

 at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)

 at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)

 at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)

 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
 ==

 The query tested is

 spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location
 's3://test-dev/s3_dwserver_sql_t1')

 spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search
 where pdate='2015-05-12' limit 100;
 ==

 It seems it generated query results into tmp dir firstly, and tries to
 rename it into the right folder finally. But, it failed while renaming it.

 I appreciate any advice.
 Thanks,
 Okehee





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: dynamicAllocation spark-shell

2015-04-23 Thread Cheolsoo Park
Hi,

 Attempted to request a negative number of executor(s) -663 from the
cluster manager. Please specify a positive number!

This is a bug in dynamic allocation. Here is the jira-
https://issues.apache.org/jira/browse/SPARK-6954

Thanks!
Cheolsoo

On Thu, Apr 23, 2015 at 7:57 AM, Michael Stone mst...@mathom.us wrote:

 If I enable dynamicAllocation and then use spark-shell or pyspark, things
 start out working as expected: running simple commands causes new executors
 to start and complete tasks. If the shell is left idle for a while,
 executors start getting killed off:

 15/04/23 10:52:43 INFO cluster.YarnClientSchedulerBackend: Requesting to
 kill executor(s) 368
 15/04/23 10:52:43 INFO spark.ExecutorAllocationManager: Removing executor
 368 because it has been idle for 600 seconds (new desired total will be 665)

 That makes sense. But the action also results in error messages:

 15/04/23 10:52:47 ERROR cluster.YarnScheduler: Lost executor 368 on
 hostname: remote Akka client disassociated
 15/04/23 10:52:47 INFO scheduler.DAGScheduler: Executor lost: 368 (epoch 0)
 15/04/23 10:52:47 INFO spark.ExecutorAllocationManager: Existing executor
 368 has been removed (new total is 665)
 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Trying to remove
 executor 368 from BlockManagerMaster.
 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Removing block
 manager BlockManagerId(368, hostname, 35877)
 15/04/23 10:52:47 INFO storage.BlockManagerMaster: Removed 368
 successfully in removeExecutor

 After that, trying to run a simple command results in:

 15/04/23 10:13:30 ERROR util.Utils: Uncaught exception in thread
 spark-dynamic-executor-allocation-0
 java.lang.IllegalArgumentException: Attempted to request a negative number
 of executor(s) -663 from the cluster manager. Please specify a positive
 number!

 And then only the single remaining executor attempts to complete the new
 tasks. Am I missing some kind of simple configuration item, are other
 people seeing the same behavior as a bug, or is this actually expected?

 Mike Stone

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org