Try to repartition the data like: val data = sc.textFile("/user/foo/myfiles/*").repartition(100)
Since the file size is small it shouldn't be a problem. Thanks Best Regards On Tue, Dec 16, 2014 at 6:21 PM, bethesda <swearinge...@mac.com> wrote: > > Our job is creating what appears to be an inordinate number of very small > tasks, which blow out our os inode and file limits. Rather than > continually > upping those limits, we are seeking to understand whether our real problem > is that too many tasks are running, perhaps because we are mis-configured > or > we are coding incorrectly. > > Rather than posting our actual code I have re-created the essence of the > matter in the shell with a directory of files simulating the data we deal > with. We have three servers, each with 8G RAM. > > Given 1,000 files, each containing a string of 100 characters, in the > myfiles directory: > > val data = sc.textFile("/user/foo/myfiles/*") > > val c = data.count > > The count operation produces 1,000 tasks. Is this normal? > > val cart = data.cartesian(data) > cart.count > > The cartesian operation produces 1M tasks. I understand that the cartesian > product of 1,000 items against itself is 1M, however, it seems the overhead > of all this task creation and file I/O of all these tiny files outweighs > the > gains of distributed computing. What am I missing here? > > Below is the truncated output of the count operation, if this helps > indicate > a configuration problem. > > Thank you. > > scala> data.count > 14/12/16 07:40:46 INFO FileInputFormat: Total input paths to process : 1000 > 14/12/16 07:40:47 INFO SparkContext: Starting job: count at <console>:15 > 14/12/16 07:40:47 INFO DAGScheduler: Got job 0 (count at <console>:15) with > 1000 output partitions (allowLocal=false) > 14/12/16 07:40:47 INFO DAGScheduler: Final stage: Stage 0(count at > <console>:15) > 14/12/16 07:40:47 INFO DAGScheduler: Parents of final stage: List() > 14/12/16 07:40:47 INFO DAGScheduler: Missing parents: List() > 14/12/16 07:40:47 INFO DAGScheduler: Submitting Stage 0 > (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12), which > has > no missing parents > 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(2400) called with > curMem=507154, maxMem=278019440 > 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2 stored as values in > memory (estimated size 2.3 KB, free 264.7 MB) > 14/12/16 07:40:47 INFO MemoryStore: ensureFreeSpace(1813) called with > curMem=509554, maxMem=278019440 > 14/12/16 07:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as > bytes > in memory (estimated size 1813.0 B, free 264.7 MB) > 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on dev-myserver-1.abc.cloud:44041 (size: 1813.0 B, free: 265.1 MB) > 14/12/16 07:40:47 INFO BlockManagerMaster: Updated info of block > broadcast_2_piece0 > 14/12/16 07:40:47 INFO DAGScheduler: Submitting 1000 missing tasks from > Stage 0 (/user/ds/randomfiles/* MappedRDD[3] at textFile at <console>:12) > 14/12/16 07:40:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000 > tasks > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID > 0, dev-myserver-2.abc.cloud, NODE_LOCAL, 1202 bytes) > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID > 1, dev-myserver-3.abc.cloud, NODE_LOCAL, 1201 bytes) > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID > 2, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID > 3, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID > 4, dev-myserver-3.abc.cloud, NODE_LOCAL, 1204 bytes) > 14/12/16 07:40:47 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID > 5, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from > [dev-myserver-3.abc.cloud/10.40.13.192:36133] > 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from > [dev-myserver-2.abc.cloud/10.40.13.195:35716] > 14/12/16 07:40:47 INFO ConnectionManager: Accepted connection from > [dev-myserver-1.abc.cloud/10.40.13.194:33728] > 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to > [dev-myserver-1.abc.cloud/10.40.13.194:49458] > 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to > [dev-myserver-3.abc.cloud/10.40.13.192:58579] > 14/12/16 07:40:47 INFO SendingConnection: Initiating connection to > [dev-myserver-2.abc.cloud/10.40.13.195:52502] > 14/12/16 07:40:47 INFO SendingConnection: Connected to > [dev-myserver-3.abc.cloud/10.40.13.192:58579], 1 messages pending > 14/12/16 07:40:47 INFO SendingConnection: Connected to > [dev-myserver-1.abc.cloud/10.40.13.194:49458], 1 messages pending > 14/12/16 07:40:47 INFO SendingConnection: Connected to > [dev-myserver-2.abc.cloud/10.40.13.195:52502], 1 messages pending > 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on dev-myserver-1.abc.cloud:49458 (size: 1813.0 B, free: 1060.0 MB) > 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on dev-myserver-3.abc.cloud:58579 (size: 1813.0 B, free: 1060.0 MB) > 14/12/16 07:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on dev-myserver-2.abc.cloud:52502 (size: 1813.0 B, free: 1060.0 MB) > 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on dev-myserver-3.abc.cloud:58579 (size: 32.5 KB, free: 1060.0 MB) > 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on dev-myserver-2.abc.cloud:52502 (size: 32.5 KB, free: 1060.0 MB) > 14/12/16 07:40:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on dev-myserver-1.abc.cloud:49458 (size: 32.5 KB, free: 1060.0 MB) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID > 6, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID > 7, dev-myserver-1.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 19.0 in stage 0.0 (TID > 8, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 23.0 in stage 0.0 (TID > 9, dev-myserver-2.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID > 10, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID > 11, dev-myserver-3.abc.cloud, NODE_LOCAL, 1203 bytes) > 14/12/16 07:40:49 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID > 2) in 1964 ms on dev-myserver-1.abc.cloud (1/1000) > 14/12/16 07:40:49 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID > 0) in 2000 ms on dev-myserver-2.abc.cloud (2/1000) > 14/12/16 07:40:49 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID > 5) in 1975 ms on dev-myserver-1.abc.cloud (3/1000) > .... > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >