[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-04-01 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390771#comment-14390771
 ] 

Antony Mayi commented on SPARK-6334:


bq. btw. I see based on the sourcecode checkpointing should be happening every 
3 iterations - how comes I don't see any drops in the disk usage at least once 
every three iterations? it just seems to be growing constantly... which worries 
me that even more frequent checkpointing wont help...

ok, I am now sure increasing the checkpointing interval is likely not going to 
help same as it is not helping now - the disk usage just grows even after 3x 
iterations. I just tried dirty hack - running parallel thread that forces GC 
every x minutes and suddenly I can notice the disk space gets cleared upon 
every three iterations when GC runs.

see this pattern - first run without forcing GC and then another one where 
there is noticeable disk usage drops every three steps (ALS iterations):
!gc.png!

so really what's needed to get the shuffles cleaned upon checkpointing is 
forcing GC.

this was my dirty hack:

{code}
from threading import Thread, Event
class GC(Thread):
def __init__(self, context, period=600):
Thread.__init__(self)
self.context = context
self.period = period
self.daemon = True
self.stopped = Event()
def stop(self):
self.stopped.set()
def run(self):
self.stopped.clear()
while not self.stopped.is_set():
self.stopped.wait(self.period)
self.context._jvm.System.gc()

sc.setCheckpointDir('/tmp')

gc = GC(sc)
gc.start()

training = 
sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)

gc.stop()
{code}

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png, gc.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-20 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371122#comment-14371122
 ] 

Antony Mayi commented on SPARK-6334:


bq. 2. Use less number of blocks, even you have more CPU cores. There is a 
trade-off between communication and computation. With k = 50, I think the 
communication still dominates.

thx, this has reduced the volume of the dumped shuffle data to 50% so I can 
complete the job, very helpful!

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-17 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364764#comment-14364764
 ] 

Antony Mayi commented on SPARK-6334:


users: 12.5 millions
ratings: 3.3 billions
rank: 50
iters: 15

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366021#comment-14366021
 ] 

Xiangrui Meng commented on SPARK-6334:
--

Couple suggestions before SPARK-5955 is implemented:

1. Upgrade to Spark 1.3. ALS receives a new implementation in 1.3, where the 
shuffle size is reduced.
2. Use less number of blocks, even you have more CPU cores. There is a 
trade-off between communication and computation. With k = 50, I think the 
communication still dominates.
3. Minor. Build Spark with -Pnetlib-lgpl to include native BLAS/LAPACK 
libraries.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363629#comment-14363629
 ] 

Xiangrui Meng commented on SPARK-6334:
--

https://issues.apache.org/jira/browse/SPARK-5955 is going to solve this issue. 
[~antonymayi] Could you tell more numbers about your test, e.g., number of 
users, number of ratings, rank, and number of iterations? Thanks!

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362085#comment-14362085
 ] 

Joseph K. Bradley commented on SPARK-6334:
--

I'm not sure about that.  [~mengxr] ?

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361706#comment-14361706
 ] 

Sean Owen commented on SPARK-6334:
--

Do you have 12TB of disk available to the YARN local dir? or just 12TB in 
general?
You are setting checkpoint dir to /tmp, but talking about filling up the YARN 
local dir. Are you sure it's not /tmp that fills up?
What are the files that are filling up the disk, shuffle?
Did you try the ttl settings?

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361715#comment-14361715
 ] 

Antony Mayi commented on SPARK-6334:


bq. What are the files that are filling up the disk, shuffle?
yes, it is all the shuffle data.

bq. Did you try the ttl settings?
do you mean spark.cleaner.ttl? yes, but that leads to loss of data required 
later and ALS later then fails when trying to use it.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361712#comment-14361712
 ] 

Antony Mayi commented on SPARK-6334:


it is 12TB combined across all nodes available to YARN local-dirs.
checkpoint dir is on hdfs (/tmp used in the checkpoint is actually 
hdfs:///tmp). this is negligible in size. all the heavy volume growing during 
the ALS is on the local disks (not hdfs) under 
/diskX/yarn/local/usercache/antony.mayi/appcache/...

I have 4x500GB disks mounted as /diskX on each node. YARN is configured to use 
these disks for the local dirs:
{code:xml}
  property
nameyarn.nodemanager.local-dirs/name
valuefile:///disk1/yarn/local, file:///disk2/yarn/local, 
file:///disk3/yarn/local, file:///disk4/yarn/local/value
  /property
{code}

the usage on each of the disks before starting ALS is ~7%. During the ALS they 
all grow roughly same (all nodes across all disks) until the 90% threshold.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361912#comment-14361912
 ] 

Sean Owen commented on SPARK-6334:
--

Hm, is this a case where it's necessary to cut the lineage fairly frequently 
with a persist(), so that older stages can be cleaned up safely? that might be 
the way forward if you have a long and very large computation.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361914#comment-14361914
 ] 

Joseph K. Bradley commented on SPARK-6334:
--

I don't think persist() will eliminate shuffle data; I think checkpoint is 
necessary to do that.  I agree that checkpointing more frequently seems the way 
to go.

One side comment: You are using a lot of partitions for this size computing 
cluster.  I'd recommend using fewer partitions (between # workers and # cores) 
for ALS.  That may not fix the main issue but may help some.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361947#comment-14361947
 ] 

Antony Mayi commented on SPARK-6334:


I had to increase the partitioning up to this level due to permanent OOM issues 
(GC overhead limit exceeded) - although there was enough RAM globally, the 
partitions were too big for individual executors (I have 28GB RAM + 4GB for 
spark.yarn.executor.memoryOverhead per each executor). with 728 partitions I 
got around the OOM problem.

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6334) spark-local dir not getting cleared during ALS

2015-03-14 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361951#comment-14361951
 ] 

Antony Mayi commented on SPARK-6334:


btw. I see based on the sourcecode checkpointing should be happening every 3 
iterations - how comes I don't see any drops in the disk usage at least once 
every three iterations? it just seems to be growing constantly... which worries 
me that even more frequent checkpointing wont help...

 spark-local dir not getting cleared during ALS
 --

 Key: SPARK-6334
 URL: https://issues.apache.org/jira/browse/SPARK-6334
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Antony Mayi
 Attachments: als-diskusage.png


 when running bigger ALS training spark spills loads of temp data into the 
 local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
 on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
 out of space (in my case I have 12TB of available disk capacity before 
 kicking off the ALS but it all gets used (and yarn kills the containers when 
 reaching 90%).
 even with all recommended options (configuring checkpointing and forcing GC 
 when possible) it still doesn't get cleared.
 here is my (pseudo)code (pyspark):
 {code}
 sc.setCheckpointDir('/tmp')
 training = 
 sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
 model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
 sc._jvm.System.gc()
 {code}
 the training RDD has about 3.5 billions of items (~60GB on disk). after about 
 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
 gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
 37 executors of 4 cores/28+4GB RAM each.
 this is the graph of disk consumption pattern showing the space being all 
 eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
 !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org