Creating an RDD from a wildcard like this:
val data = sc.textFile("/user/foo/myfiles/*")

Will create 1 partition for each file found. 1000 files = 1000 partitions.
A task is a job stage (defined as a sequence of transformations) applied to
a partition, so 1000 partitions = 1000 tasks per stage.

You can reduce the amount of partitions at any time with rdd.coalesce:
val coalescedRDD = data.coalesce (10)  // 10 partitions

-kr, Gerard.
@maasg




On Tue, Dec 16, 2014 at 1:51 PM, bethesda <swearinge...@mac.com> wrote:
>
> Our job is creating what appears to be an inordinate number of very small
> tasks, which blow out our os inode and file limits.  Rather than
> continually
> upping those limits, we are seeking to understand whether our real problem
> is that too many tasks are running, perhaps because we are mis-configured
> or
> we are coding incorrectly.
>
> Rather than posting our actual code I have re-created the essence of the
> matter in the shell with a directory of files simulating the data we deal
> with.  We have three servers, each with 8G RAM.
>
> Given 1,000 files, each containing a string of 100 characters, in the
> myfiles directory:
>
> val data = sc.textFile("/user/foo/myfiles/*")
>
> val c = data.count
>
> The count operation produces 1,000 tasks.  Is this normal?
>
> val cart = data.cartesian(data)
> cart.count
>
> The cartesian operation produces 1M tasks.  I understand that the cartesian
> product of 1,000 items against itself is 1M, however, it seems the overhead
> of all this task creation and file I/O of all these tiny files outweighs
> the
> gains of distributed computing.  What am I missing here?
>
> Below is the truncated output of the count operation, if this helps
> indicate
> a configuration problem.
>
> Thank you.
>
> scala> data.count
> 14/12/16 07:40:46 INFO FileInputFormat: Total input paths to process : 1000
> 14/12/16 07:40:47 INFO SparkContext: Starting job: count at <console>:15
> 14/12/16 07:40:47 INFO DAGScheduler: Got job 0 (count at <console>:15) with
> 1000 output partitions (allowLocal=false)
> 14/12/16 07:40:47 INFO DAGScheduler: Final stage: Stage 0(count at
> <console>:15)
> 14/12/16 07:40:47 INFO DAGScheduler: Parents of final stage: List()
> 14/12/16 07:40:47 INFO DAGScheduler: Missing parents: List()
> 14/12/16 07:40:47 INFO DAGScheduler: Submitting Stage 0
> (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12), which
> has
> no missing parents
> 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(2400) called with
> curMem=507154, maxMem=278019440
> 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2 stored as values in
> memory (estimated size 2.3 KB, free 264.7 MB)
> 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(1813) called with
> curMem=509554, maxMem=278019440
> 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as
> bytes
> in memory (estimated size 1813.0 B, free 264.7 MB)
> 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
> on dev-myserver-1.abc.cloud:44041 (size: 1813.0 B, free: 265.1 MB)
> 14/12/16 07:40:47 INFO BlockManagerMaster: Updated info of block
> broadcast_2_piece0
> 14/12/16 07:40:47 INFO DAGScheduler: Submitting 1000 missing tasks from
> Stage 0 (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12)
> 14/12/16 07:40:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000
> tasks
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 0, dev-myserver-2.abc.cloud, NODE_LOCAL, 1202 bytes)
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 1, dev-myserver-3.abc.cloud, NODE_LOCAL, 1201 bytes)
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
> 2, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
> 3, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 4, dev-myserver-3.abc.cloud, NODE_LOCAL, 1204 bytes)
> 14/12/16 07:40:47 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
> 5, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
> [dev-myserver-3.abc.cloud/10.40.13.192:36133]
> 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
> [dev-myserver-2.abc.cloud/10.40.13.195:35716]
> 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from
> [dev-myserver-1.abc.cloud/10.40.13.194:33728]
> 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
> [dev-myserver-1.abc.cloud/10.40.13.194:49458]
> 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
> [dev-myserver-3.abc.cloud/10.40.13.192:58579]
> 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to
> [dev-myserver-2.abc.cloud/10.40.13.195:52502]
> 14/12/16 07:40:47 INFO SendingConnection: Connected to
> [dev-myserver-3.abc.cloud/10.40.13.192:58579], 1 messages pending
> 14/12/16 07:40:47 INFO SendingConnection: Connected to
> [dev-myserver-1.abc.cloud/10.40.13.194:49458], 1 messages pending
> 14/12/16 07:40:47 INFO SendingConnection: Connected to
> [dev-myserver-2.abc.cloud/10.40.13.195:52502], 1 messages pending
> 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
> on dev-myserver-1.abc.cloud:49458 (size: 1813.0 B, free: 1060.0 MB)
> 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
> on dev-myserver-3.abc.cloud:58579 (size: 1813.0 B, free: 1060.0 MB)
> 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
> on dev-myserver-2.abc.cloud:52502 (size: 1813.0 B, free: 1060.0 MB)
> 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
> on dev-myserver-3.abc.cloud:58579 (size: 32.5 KB, free: 1060.0 MB)
> 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
> on dev-myserver-2.abc.cloud:52502 (size: 32.5 KB, free: 1060.0 MB)
> 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
> on dev-myserver-1.abc.cloud:49458 (size: 32.5 KB, free: 1060.0 MB)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
> 6, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
> 7, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID
> 8, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 23.0 in stage 0.0 (TID
> 9, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
> 10, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID
> 11, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes)
> 14/12/16 07:40:49 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
> 2) in 1964 ms on dev-myserver-1.abc.cloud (1/1000)
> 14/12/16 07:40:49 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
> 0) in 2000 ms on dev-myserver-2.abc.cloud (2/1000)
> 14/12/16 07:40:49 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
> 5) in 1975 ms on dev-myserver-1.abc.cloud (3/1000)
> ....
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to