Re: Memory-efficient successive calls to repartition()

shahid ashraf Wed, 02 Sep 2015 08:54:06 -0700

Hi Guys

It seems my problems is related to his question as well. i am running
standalone spark 1.4.1 on local machine


i have 10 partitions with data skew on partition 1 and 4 partition: [(0,
0), (*1, 15593259)*, (2, 0), (3, 0), (*4, 20695601)*, (5, 0), (6, 0), (7,
0), (8, 0), (9, 0)] and elements: >>

Now i try to rdd.repartition(10) and getting errors like ERROR Executor:
Exception in task 1.0 in stage 10.0 (TID 61)
java.lang.OutOfMemoryError: Java heap space and* ERROR
DiskBlockObjectWriter  in stack trace*


*I tried to enhance  "spark.executor.memory", "4g" as well getting same
errors again*

15/09/02 21:12:43 ERROR Executor: Exception in task 1.0 in stage 10.0 (TID
61)
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/02 21:12:43 WARN TaskSetManager: Lost task 1.0 in stage 10.0 (TID 61,
localhost): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

15/09/02 21:12:43 ERROR TaskSetManager: Task 1 in stage 10.0 failed 1
times; aborting job
15/09/02 21:12:43 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/02 21:12:43 INFO SparkContext: Invoking stop() from shutdown hook
15/09/02 21:12:43 INFO TaskSchedulerImpl: Cancelling stage 10
15/09/02 21:12:43 INFO Executor: Executor is trying to kill task 4.0 in
stage 10.0 (TID 64)
15/09/02 21:12:43 INFO TaskSchedulerImpl: Stage 10 was cancelled
15/09/02 21:12:43 INFO DAGScheduler: ShuffleMapStage 10 (repartition at
NativeMethodAccessorImpl.java:-2) failed in 102.132 s
15/09/02 21:12:43 INFO DAGScheduler: Job 4 failed: collect at
/Users/shahid/projects/spark_rl/record_linker_spark.py:74, took 102.154710 s
Traceback (most recent call last):
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 121,
in <module>
15/09/02 21:12:43 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.2:4040
15/09/02 21:12:43 INFO DAGScheduler: Stopping DAGScheduler
15/09/02 21:12:43 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/09/02 21:12:43 INFO Utils: path =
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/blockmgr-e8e82c2e-ca87-4667-a330-b1edb83aa81f,
already present as root for deletion.
15/09/02 21:12:43 INFO MemoryStore: MemoryStore cleared
15/09/02 21:12:43 INFO BlockManager: BlockManager stopped
15/09/02 21:12:43 INFO BlockManagerMaster: BlockManagerMaster stopped
15/09/02 21:12:43 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting
down remote daemon.
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote
daemon shut down; proceeding with flushing remote transports.
15/09/02 21:12:43 WARN PythonRDD: Incomplete task interrupted: Attempting
to kill Python Worker
15/09/02 21:12:43 INFO RemoteActorRefProvider$RemotingTerminator: Remoting
shut down.
    R.print_no_elements(mathes_grp)
  File "/Users/shahid/projects/spark_rl/record_linker_spark.py", line 74,
in print_no_elements
    print "\n* * * * * * * * * * * * * * *\n<< partition: %s and elements:
>> \n* * * * * * * * * * * * * * *\n" %
rdd.mapPartitionsWithIndex(self.index_no_elements).collect()
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
line 757, in collect
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/Users/shahid/projects/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage
10.0 (TID 61, localhost): java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:113)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:97)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:105)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:375)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:208)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/09/02 21:12:43 INFO SparkContext: Successfully stopped SparkContext
15/09/02 21:12:43 INFO Utils: Shutdown hook called
15/09/02 21:12:43 INFO Utils: Deleting directory
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/pyspark-090350a4-e0ab-4692-8360-b6be8d64d1fe
15/09/02 21:12:43 INFO Utils: Deleting directory
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e
15/09/02 21:12:45 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/blockmgr-e8e82c2e-ca87-4667-a330-b1edb83aa81f/05/temp_shuffle_f80f61c5-4fb6-4d46-a81b-239095ebf20d
java.io.FileNotFoundException:
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e/blockmgr-e8e82c2e-ca87-4667-a330-b1edb83aa81f/05/temp_shuffle_f80f61c5-4fb6-4d46-a81b-239095ebf20d
(No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at
org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(BlockObjectWriter.scala:189)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$stop$2.apply(ExternalSorter.scala:807)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$stop$2.apply(ExternalSorter.scala:806)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.util.collection.ExternalSorter.stop(ExternalSorter.scala:806)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:94)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:76)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/02 21:12:46 ERROR Utils: Exception while deleting Spark temp dir:
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e
java.io.IOException: Failed to delete:
/private/var/folders/5f/3v7mdvrn02lcfczhr9my0ljc0000gn/T/spark-79bb5b4f-9be4-47ed-a722-bfc79880622e
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:963)
at
org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:204)
at
org.apache.spark.util.Utils$$anonfun$1$$anonfun$apply$mcV$sp$5.apply(Utils.scala:201)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.util.Utils$$anonfun$1.apply$mcV$sp(Utils.scala:201)
at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2278)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2278)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2278)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(Utils.scala:2278)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2278)
at
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(Utils.scala:2278)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.util.SparkShutdownHookManager.runAll(Utils.scala:2278)
at
org.apache.spark.util.SparkShutdownHookManager$$anon$6.run(Utils.scala:2260)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

On Wed, Sep 2, 2015 at 8:41 PM, alexis GILLAIN <ila...@hotmail.com> wrote:

> Just made some tests on my laptop.
>
> Deletion of the files is not immediate but a System.gc() call makes the
> job on shuffle files of a checkpointed RDD.
> It should solve your problem (`sc._jvm.System.gc()` in Python as pointed
> in the databricks link in my previous message).
>
>
> 2015-09-02 20:55 GMT+08:00 Aurélien Bellet <
> aurelien.bel...@telecom-paristech.fr>:
>
>> Thanks a lot for the useful link and comments Alexis!
>>
>> First of all, the problem occurs without doing anything else in the code
>> (except of course loading my data from HDFS at the beginning) - so it
>> definitely comes from the shuffling. You're right, in the current version,
>> checkpoint files are not removed and take up some space in HDFS (this is
>> easy to fix). But this is negligible compared to the non hdfs files which
>> keeps growing as iterations go. So I agree with you that this must come
>> from the shuffling operations: it seems that the shuffle files are not
>> removed along the execution (they are only removed if I stop/kill the
>> application), despite the use of checkpoint.
>>
>> The class you mentioned is very interesting but I did not find a way to
>> use it from pyspark. I will try to implement my own version, looking at the
>> source code. But besides the queueing and removing of checkpoint files, I
>> do not really see anything special there that could solve my issue.
>>
>> I will continue to investigate this. Just found out I can use a command
>> line browser to look at the webui (I cannot access the server in graphical
>> display mode), this should help me understand what's going on. I will also
>> try the workarounds mentioned in the link. Keep you posted.
>>
>> Again, thanks a lot!
>>
>> Best,
>>
>> Aurelien
>>
>>
>> Le 02/09/2015 14:15, alexis GILLAIN a écrit :
>>
>>> Aurélien,
>>>
>>>  From what you're saying, I can think of a couple of things considering
>>> I don't know what you are doing in the rest of the code :
>>>
>>> - There is lot of non hdfs writes, it comes from the rest of your code
>>> and/or repartittion(). Repartition involve a shuffling and creation of
>>> files on disk. I would have said that the problem come from that but I
>>> just checked and checkpoint() is supposed to delete shuffle files :
>>>
>>> https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
>>> (looks exactly as your problem so you could maybe try the others
>>> workarounds)
>>> Still, you may do a lot of shuffle in the rest of the code (you should
>>> see the amount of shuffle files written in the webui) and consider
>>> increasing the disk space available...if you can do that.
>>>
>>> - On the hdfs side, the class I pointed to has an update function which
>>> "automatically handles persisting and (optionally) checkpointing, as
>>> well as unpersisting and removing checkpoint files". Not sure your
>>> method for checkpointing remove previous checkpoint file.
>>>
>>> In the end, does the disk space error come from hdfs growing or local
>>> disk growing ?
>>>
>>> You should check the webui to identify which tasks spill data on disk
>>> and verify if the shuffle files are properly deleted when you checkpoint
>>> your rdd.
>>>
>>>
>>> Regards,
>>>
>>>
>>> 2015-09-01 22:48 GMT+08:00 Aurélien Bellet
>>> <aurelien.bel...@telecom-paristech.fr
>>> <mailto:aurelien.bel...@telecom-paristech.fr>>:
>>>
>>>
>>>     Dear Alexis,
>>>
>>>     Thanks again for your reply. After reading about checkpointing I
>>>     have modified my sample code as follows:
>>>
>>>     for i in range(1000):
>>>          print i
>>>          data2=data.repartition(50).cache()
>>>          if (i+1) % 10 == 0:
>>>              data2.checkpoint()
>>>          data2.first() # materialize rdd
>>>          data.unpersist() # unpersist previous version
>>>          data=data2
>>>
>>>     The data is checkpointed every 10 iterations to a directory that I
>>>     specified. While this seems to improve things a little bit, there is
>>>     still a lot of writing on disk (appcache directory, shown as "non
>>>     HDFS files" in Cloudera Manager) *besides* the checkpoint files
>>>     (which are regular HDFS files), and the application eventually runs
>>>     out of disk space. The same is true even if I checkpoint at every
>>>     iteration.
>>>
>>>     What am I doing wrong? Maybe some garbage collector setting?
>>>
>>>     Thanks a lot for the help,
>>>
>>>     Aurelien
>>>
>>>     Le 24/08/2015 10:39, alexis GILLAIN a écrit :
>>>
>>>         Hi Aurelien,
>>>
>>>         The first code should create a new RDD in memory at each
>>> iteration
>>>         (check the webui).
>>>         The second code will unpersist the RDD but that's not the main
>>>         problem.
>>>
>>>         I think you have trouble due to long lineage as .cache() keep
>>>         track of
>>>         lineage for recovery.
>>>         You should have a look at checkpointing :
>>>
>>> https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointer.scala
>>>
>>>         You can also have a look at the code of others iterative
>>>         algorithms in
>>>         mlllib for best practices.
>>>
>>>         2015-08-20 17:26 GMT+08:00 abellet
>>>         <aurelien.bel...@telecom-paristech.fr
>>>         <mailto:aurelien.bel...@telecom-paristech.fr>
>>>         <mailto:aurelien.bel...@telecom-paristech.fr
>>>
>>>         <mailto:aurelien.bel...@telecom-paristech.fr>>>:
>>>
>>>              Hello,
>>>
>>>              For the need of my application, I need to periodically
>>>         "shuffle" the
>>>              data
>>>              across nodes/partitions of a reasonably-large dataset. This
>>>         is an
>>>              expensive
>>>              operation but I only need to do it every now and then.
>>>         However it
>>>              seems that
>>>              I am doing something wrong because as the iterations go the
>>>         memory usage
>>>              increases, causing the job to spill onto HDFS, which
>>>         eventually gets
>>>              full. I
>>>              am also getting some "Lost executor" errors that I don't
>>>         get if I don't
>>>              repartition.
>>>
>>>              Here's a basic piece of code which reproduces the problem:
>>>
>>>              data =
>>>         sc.textFile("ImageNet_gist_train.txt",50).map(parseLine).cache()
>>>              data.count()
>>>              for i in range(1000):
>>>                       data=data.repartition(50).persist()
>>>                       # below several operations are done on data
>>>
>>>
>>>              What am I doing wrong? I tried the following but it doesn't
>>>         solve
>>>              the issue:
>>>
>>>              for i in range(1000):
>>>                       data2=data.repartition(50).persist()
>>>                       data2.count() # materialize rdd
>>>                       data.unpersist() # unpersist previous version
>>>                       data=data2
>>>
>>>
>>>              Help and suggestions on this would be greatly appreciated!
>>>         Thanks a lot!
>>>
>>>
>>>
>>>
>>>              --
>>>              View this message in context:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Memory-efficient-successive-calls-to-repartition-tp24358.html
>>>              Sent from the Apache Spark User List mailing list archive
>>>         at Nabble.com.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>>              To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>         <mailto:user-unsubscr...@spark.apache.org>
>>>              <mailto:user-unsubscr...@spark.apache.org
>>>         <mailto:user-unsubscr...@spark.apache.org>>
>>>              For additional commands, e-mail: user-h...@spark.apache.org
>>>         <mailto:user-h...@spark.apache.org>
>>>              <mailto:user-h...@spark.apache.org
>>>         <mailto:user-h...@spark.apache.org>>
>>>
>>>
>>>
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>     <mailto:user-unsubscr...@spark.apache.org>
>>>     For additional commands, e-mail: user-h...@spark.apache.org
>>>     <mailto:user-h...@spark.apache.org>
>>>
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
with Regards
Shahid Ashraf

Re: Memory-efficient successive calls to repartition()

Reply via email to