Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?
Alex, see this jira- https://issues.apache.org/jira/browse/SPARK-9926 On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Ran into this need myself. Does Spark have an equivalent of "mapreduce. > input.fileinputformat.list-status.num-threads"? > > Thanks. > > On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park <piaozhe...@gmail.com> > wrote: > >> Hi, >> >> I am wondering if anyone has successfully enabled >> "mapreduce.input.fileinputformat.list-status.num-threads" in Spark jobs. I >> usually set this property to 25 to speed up file listing in MR jobs (Hive >> and Pig). But for some reason, this property does not take effect in Spark >> HadoopRDD resulting in serious delay in file listing. >> >> I verified that the property is indeed set in HadoopRDD by logging the >> value of the property in the getPartitions() function. I also tried to >> attach VisualVM to Spark and Pig clients, which look as follows- >> >> In Pig, I can see 25 threads running in parallel for file listing- >> [image: Inline image 1] >> >> In Spark, I only see 2 threads running in parallel for file listing- >> [image: Inline image 2] >> >> What's strange is that the # of concurrent threads in Spark is throttled >> no matter how high I >> set "mapreduce.input.fileinputformat.list-status.num-threads". >> >> Is anyone using Spark with this property enabled? If so, can you please >> share how you do it? >> >> Thanks! >> Cheolsoo >> > >
Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?
Hi, I am wondering if anyone has successfully enabled mapreduce.input.fileinputformat.list-status.num-threads in Spark jobs. I usually set this property to 25 to speed up file listing in MR jobs (Hive and Pig). But for some reason, this property does not take effect in Spark HadoopRDD resulting in serious delay in file listing. I verified that the property is indeed set in HadoopRDD by logging the value of the property in the getPartitions() function. I also tried to attach VisualVM to Spark and Pig clients, which look as follows- In Pig, I can see 25 threads running in parallel for file listing- [image: Inline image 1] In Spark, I only see 2 threads running in parallel for file listing- [image: Inline image 2] What's strange is that the # of concurrent threads in Spark is throttled no matter how high I set mapreduce.input.fileinputformat.list-status.num-threads. Is anyone using Spark with this property enabled? If so, can you please share how you do it? Thanks! Cheolsoo
Re: SparkSQL failing while writing into S3 for 'insert into table'
It seems it generated query results into tmp dir firstly, and tries to rename it into the right folder finally. But, it failed while renaming it. This problem exists not only in SparkSQL but also in any Hadoop tools (e.g. Hive, Pig, etc) when using with s3. Usually, It is better to write task outputs to local disk and copy them to the final S3 location in the task commit phase. In fact, this is how EMR Hive does insert overwrite, and that's why EMR Hive works well with S3 while Apache Hive doesn't. If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop task. Basically, you can modify that code to make it write to local disk first and then commit to the final s3 location. Actually, I am doing the same at work in 1.4 branch. On Fri, May 22, 2015 at 5:50 PM, ogoh oke...@gmail.com wrote: Hello, I am using spark 1.3 Hive 0.13.1 in AWS. From Spark-SQL, when running Hive query to export Hive query result into AWS S3, it failed with the following message: == org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths: s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1 has nested directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157) at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298) at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088) == The query tested is spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location 's3://test-dev/s3_dwserver_sql_t1') spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search where pdate='2015-05-12' limit 100; == It seems it generated query results into tmp dir firstly, and tries to rename it into the right folder finally. But, it failed while renaming it. I appreciate any advice. Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: dynamicAllocation spark-shell
Hi, Attempted to request a negative number of executor(s) -663 from the cluster manager. Please specify a positive number! This is a bug in dynamic allocation. Here is the jira- https://issues.apache.org/jira/browse/SPARK-6954 Thanks! Cheolsoo On Thu, Apr 23, 2015 at 7:57 AM, Michael Stone mst...@mathom.us wrote: If I enable dynamicAllocation and then use spark-shell or pyspark, things start out working as expected: running simple commands causes new executors to start and complete tasks. If the shell is left idle for a while, executors start getting killed off: 15/04/23 10:52:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 368 15/04/23 10:52:43 INFO spark.ExecutorAllocationManager: Removing executor 368 because it has been idle for 600 seconds (new desired total will be 665) That makes sense. But the action also results in error messages: 15/04/23 10:52:47 ERROR cluster.YarnScheduler: Lost executor 368 on hostname: remote Akka client disassociated 15/04/23 10:52:47 INFO scheduler.DAGScheduler: Executor lost: 368 (epoch 0) 15/04/23 10:52:47 INFO spark.ExecutorAllocationManager: Existing executor 368 has been removed (new total is 665) 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Trying to remove executor 368 from BlockManagerMaster. 15/04/23 10:52:47 INFO storage.BlockManagerMasterActor: Removing block manager BlockManagerId(368, hostname, 35877) 15/04/23 10:52:47 INFO storage.BlockManagerMaster: Removed 368 successfully in removeExecutor After that, trying to run a simple command results in: 15/04/23 10:13:30 ERROR util.Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -663 from the cluster manager. Please specify a positive number! And then only the single remaining executor attempts to complete the new tasks. Am I missing some kind of simple configuration item, are other people seeing the same behavior as a bug, or is this actually expected? Mike Stone - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org