[jira] [Comment Edited] (SPARK-24474) Cores are left idle when there are a lot of tasks to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532910#comment-16532910 ] Al M edited comment on SPARK-24474 at 7/4/18 4:17 PM: -- My initial tests suggest that this stops the issue from happening. Thanks! I will perform more tests to make 100% sure that it does not still occur. I am surprised that this config makes a difference. My tasks are usually quite big; normally taking about a minute each. I would not have expected a change from waiting 3s per task to 0s per task to make such a huge difference. Do you know if there is any unexpected logic around this config setting? was (Author: alrocks46): My initial tests suggest that this stops the issue from happening. Thanks! I will perform more tests to make 100% sure that it does not still occur. I am surprised that this config makes a difference. My tasks are usually quite big; normally taking about a minute each. I would not have expected a change from waiting 3s per task to 0s per task to make such a huge difference. Do you know if there is any unusual behaviour around this config setting? > Cores are left idle when there are a lot of tasks to run > > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of tasks to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532910#comment-16532910 ] Al M commented on SPARK-24474: -- My initial tests suggest that this stops the issue from happening. Thanks! I will perform more tests to make 100% sure that it does not still occur. I am surprised that this config makes a difference. My tasks are usually quite big; normally taking about a minute each. I would not have expected a change from waiting 3s per task to 0s per task to make such a huge difference. Do you know if there is any unusual behaviour around this config setting? > Cores are left idle when there are a lot of tasks to run > > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)
[ https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532534#comment-16532534 ] Al M commented on SPARK-13127: -- Would be great to get this resolved in Spark 2.3.2. Especially since Parquet 1.9 supports delta encoding: https://issues.apache.org/jira/browse/PARQUET-225 > Upgrade Parquet to 1.9 (Fixes parquet sorting) > -- > > Key: SPARK-13127 > URL: https://issues.apache.org/jira/browse/SPARK-13127 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Justin Pihony >Priority: Major > > Currently, when you write a sorted DataFrame to Parquet, then reading the > data back out is not sorted by default. [This is due to a bug in > Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in > 1.9. > There is a workaround to read the file back in using a file glob (filepath/*). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of tasks to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Summary: Cores are left idle when there are a lot of tasks to run (was: Cores are left idle when there are a lot of stages to run) > Cores are left idle when there are a lot of tasks to run > > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507965#comment-16507965 ] Al M commented on SPARK-24474: -- Also tried changing spark.scheduler.mode to "FIFO"; that didn't fix the problem either. > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504730#comment-16504730 ] Al M commented on SPARK-24474: -- Sample code that reproduces this: {code:java} val builder = SparkSession.builder() val spark = builder.appName("IdleCores").getOrCreate() val optimalMinPartitions = 1 spark.conf.set("spark.default.parallelism", optimalMinPartitions) spark.conf.set("spark.sql.shuffle.partitions", optimalMinPartitions) val input1 = spark.read.parquet("/input1") //Must be big val input2 = spark.read.parquet("/input2") // Must be 10x as big as input1 val joined = input1.join(input2, Seq("key1", "key2")) joined.agg(Map("somevalue" -> "SUM")).show() {code} > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The stage that hangs (referred to as "Second stage" above) has a lower > 'Stage Id' than the first one that completes > * This happens with spark.shuffle.service.enabled set to true and false -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes * This happens with spark.shuffle.service.enabled set to true and false was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled * The stage that hangs (referred to as "Second stage" above) has a lower 'Stage Id' than the first one that completes was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled > * The
[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503225#comment-16503225 ] Al M commented on SPARK-24474: -- I appreciate that 2.2.0 is slightly old but I couldn't see any scheduler fixes in later versions that sounded like this. > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled * It happens when dynamic allocation is enabled, and when it is disabled was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled > * It happens when dynamic allocation is enabled, and when it is disabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Created] (SPARK-24474) Cores are left idle when there are a lot of stages to run
Al M created SPARK-24474: Summary: Cores are left idle when there are a lot of stages to run Key: SPARK-24474 URL: https://issues.apache.org/jira/browse/SPARK-24474 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Al M I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to other stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run
[ https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24474: - Description: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to existing stages * Once all active stages are done, we release all cores to new stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled was: I've observed an issue happening consistently when: * A job contains a join of two datasets * One dataset is much larger than the other * Both datasets require some processing before they are joined What I have observed is: * 2 stages are initially active to run processing on the two datasets ** These stages are run in parallel ** One stage has significantly more tasks than the other (e.g. one has 30k tasks and the other has 2k tasks) ** Spark allocates a similar (though not exactly equal) number of cores to each stage * First stage completes (for the smaller dataset) ** Now there is only one stage running ** It still has many tasks left (usually > 20k tasks) ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103) ** This continues until the second stage completes * Second stage completes, and third begins (the stage that actually joins the data) ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active tasks = 200) Other interesting things about this: * It seems that when we have multiple stages active, and one of them finishes, it does not actually release any cores to other stages * I can't reproduce this locally on my machine, only on a cluster with YARN enabled > Cores are left idle when there are a lot of stages to run > - > > Key: SPARK-24474 > URL: https://issues.apache.org/jira/browse/SPARK-24474 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Al M >Priority: Major > > I've observed an issue happening consistently when: > * A job contains a join of two datasets > * One dataset is much larger than the other > * Both datasets require some processing before they are joined > What I have observed is: > * 2 stages are initially active to run processing on the two datasets > ** These stages are run in parallel > ** One stage has significantly more tasks than the other (e.g. one has 30k > tasks and the other has 2k tasks) > ** Spark allocates a similar (though not exactly equal) number of cores to > each stage > * First stage completes (for the smaller dataset) > ** Now there is only one stage running > ** It still has many tasks left (usually > 20k tasks) > ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = > 103) > ** This continues until the second stage completes > * Second stage completes, and third begins (the stage that actually joins > the data) > ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active > tasks = 200) > Other interesting things about this: > * It seems that when we have multiple stages active, and one of them > finishes, it does not actually release any cores to existing stages > * Once all active stages are done, we release all cores to new stages > * I can't reproduce this locally on my machine, only on a cluster with YARN > enabled -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy)
[ https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24306: - Summary: Sort a Dataset with a lambda (like RDD.sortBy) (was: Sort a Dataset with a lambda (like RDD.sortBy() )) > Sort a Dataset with a lambda (like RDD.sortBy) > -- > > Key: SPARK-24306 > URL: https://issues.apache.org/jira/browse/SPARK-24306 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Al M >Priority: Minor > > Dataset has many useful functions that do not require hard coded column names > (e.g. flatMapGroups, map). Currently the sort() function requires us to pass > the column name as a string. > It would be nice to have a function similar to the sortBy in RDD where i can > define the keys in a lambda e.g. > {code:java} > ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy() )
[ https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-24306: - Summary: Sort a Dataset with a lambda (like RDD.sortBy() ) (was: Sort a Dataset with a lambda (like RDD.sortBy()) > Sort a Dataset with a lambda (like RDD.sortBy() ) > - > > Key: SPARK-24306 > URL: https://issues.apache.org/jira/browse/SPARK-24306 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Al M >Priority: Minor > > Dataset has many useful functions that do not require hard coded column names > (e.g. flatMapGroups, map). Currently the sort() function requires us to pass > the column name as a string. > It would be nice to have a function similar to the sortBy in RDD where i can > define the keys in a lambda e.g. > {code:java} > ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy()
Al M created SPARK-24306: Summary: Sort a Dataset with a lambda (like RDD.sortBy() Key: SPARK-24306 URL: https://issues.apache.org/jira/browse/SPARK-24306 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Al M Dataset has many useful functions that do not require hard coded column names (e.g. flatMapGroups, map). Currently the sort() function requires us to pass the column name as a string. It would be nice to have a function similar to the sortBy in RDD where i can define the keys in a lambda e.g. {code:java} ds.sortBy(record => record.id){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly
[ https://issues.apache.org/jira/browse/SPARK-14532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M closed SPARK-14532. Resolution: Fixed Fix Version/s: 2.0.0 > Spark SQL IF/ELSE does not handle Double correctly > -- > > Key: SPARK-14532 > URL: https://issues.apache.org/jira/browse/SPARK-14532 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Al M > Fix For: 2.0.0 > > > I am using Spark SQL to add new columns to my data. Below is an example > snipped in Scala: > {code}myDF.withColumn("newcol", new > Column(SqlParser.parseExpression(sparkSqlExpr))).show{code} > *What Works* > If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected. > If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as > expected. > If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as > expected. > *What does not Work* > If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else > 0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else > 0.0' (decimal(2,1) and decimal(1,1)).; > If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = > "IF(1=1, 1.0, 0.0)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else > 10.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else > 10.0' (decimal(2,1) and decimal(3,1)).; > If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else > 1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else > 1.11' (decimal(2,1) and decimal(3,2)).; > It looks like the Spark SQL typing system is seeing doubles as different > types depending on the number of digits before and after the decimal point -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly
[ https://issues.apache.org/jira/browse/SPARK-14532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237058#comment-15237058 ] Al M commented on SPARK-14532: -- Thanks Bo. I'll look forward to the 2.0 release :) > Spark SQL IF/ELSE does not handle Double correctly > -- > > Key: SPARK-14532 > URL: https://issues.apache.org/jira/browse/SPARK-14532 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Al M > Fix For: 2.0.0 > > > I am using Spark SQL to add new columns to my data. Below is an example > snipped in Scala: > {code}myDF.withColumn("newcol", new > Column(SqlParser.parseExpression(sparkSqlExpr))).show{code} > *What Works* > If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected. > If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as > expected. > If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as > expected. > *What does not Work* > If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else > 0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else > 0.0' (decimal(2,1) and decimal(1,1)).; > If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = > "IF(1=1, 1.0, 0.0)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else > 10.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else > 10.0' (decimal(2,1) and decimal(3,1)).; > If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error > org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else > 1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else > 1.11' (decimal(2,1) and decimal(3,2)).; > It looks like the Spark SQL typing system is seeing doubles as different > types depending on the number of digits before and after the decimal point -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly
Al M created SPARK-14532: Summary: Spark SQL IF/ELSE does not handle Double correctly Key: SPARK-14532 URL: https://issues.apache.org/jira/browse/SPARK-14532 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Al M I am using Spark SQL to add new columns to my data. Below is an example snipped in Scala: {code}myDF.withColumn("newcol", new Column(SqlParser.parseExpression(sparkSqlExpr))).show{code} *What Works* If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected. If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as expected. If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as expected. *What does not Work* If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 0.0' (decimal(2,1) and decimal(1,1)).; If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 10.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 10.0' (decimal(2,1) and decimal(3,1)).; If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else 1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else 1.11' (decimal(2,1) and decimal(3,2)).; It looks like the Spark SQL typing system is seeing doubles as different types depending on the number of digits before and after the decimal point -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins
[ https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701158#comment-14701158 ] Al M commented on SPARK-9872: - I would also be happy if we just get the partition count from the parents. That would be even better than setting it manually since that's all my code would be doing anyway. Right now it is always using a fixed default number, which isn't much use if my spark application uses lots of different files of different sizes. Allow passing of 'numPartitions' to DataFrame joins --- Key: SPARK-9872 URL: https://issues.apache.org/jira/browse/SPARK-9872 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1 Reporter: Al M Priority: Minor When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option. My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the spark.sql.shuffle.partitions to determine the number of partitions. This means that the join with my small files will use exactly the same number of partitions as the join with my large files. I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long. If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue. My long term ideal solution would be dynamic partition determination as described in SPARK-4630. However I appreciate that it is not particularly easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins
[ https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-9872: Description: When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option. My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the spark.sql.shuffle.partitions to determine the number of partitions. This means that the join with my small files will use exactly the same number of partitions as the join with my large files. I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long. If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue. My long term ideal solution would be dynamic partition determination as described in SPARK-4630. However I appreciate that it is not particularly easy to do. was: When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option. My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the spark.sql.shuffle.partitions to determine the number of partitions. This means that the join with my small files will use exactliy the same number of partitions as the join with my large files. I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long. If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue. My long term ideal solution would be dynamic partition determination as described in SPARK-4630. However I appreciate that it is not particularly easy to do. Allow passing of 'numPartitions' to DataFrame joins --- Key: SPARK-9872 URL: https://issues.apache.org/jira/browse/SPARK-9872 Project: Spark Issue Type: Improvement Affects Versions: 1.4.1 Reporter: Al M Priority: Minor When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option. My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the spark.sql.shuffle.partitions to determine the number of partitions. This means that the join with my small files will use exactly the same number of partitions as the join with my large files. I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long. If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue. My long term ideal solution would be dynamic partition determination as described in SPARK-4630. However I appreciate that it is not particularly easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins
Al M created SPARK-9872: --- Summary: Allow passing of 'numPartitions' to DataFrame joins Key: SPARK-9872 URL: https://issues.apache.org/jira/browse/SPARK-9872 Project: Spark Issue Type: Improvement Affects Versions: 1.4.1 Reporter: Al M Priority: Minor When I join two normal RDDs, I can set the number of shuffle partitions in the 'numPartitions' argument. When I join two DataFrames I do not have this option. My spark job loads in 2 large files and 2 small files. When I perform a join, this will use the spark.sql.shuffle.partitions to determine the number of partitions. This means that the join with my small files will use exactliy the same number of partitions as the join with my large files. I can either use a low number of partitions and run out of memory on my large join, or use a high number of partitions and my small join will take far too long. If we were able to specify the number of shuffle partitions in a DataFrame join like in an RDD join, this would not be an issue. My long term ideal solution would be dynamic partition determination as described in SPARK-4630. However I appreciate that it is not particularly easy to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
[ https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318088#comment-14318088 ] Al M commented on SPARK-5768: - So when it says *Memory Used* 3.2GB / 20GB it actually means we are using 3.2GB of memory for caching out of a total 20GB available for caching? Calling the the column 'Storage Memory' this would be clearer to me. If changing the heading of the column is not an option then a tooltip explaining that it is referring to memory used for storage. I'd find it pretty useful to have another column that shows my total memory usage. Right now I can only see this by running 'free' or 'top' every machine or looking at the Yarn UI. Spark UI Shows incorrect memory under Yarn -- Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0, 1.2.1 Environment: Centos 6 Reporter: Al M Priority: Trivial I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5768) Spark UI Shows incorrect memory under Yarn
Al M created SPARK-5768: --- Summary: Spark UI Shows incorrect memory under Yarn Key: SPARK-5768 URL: https://issues.apache.org/jira/browse/SPARK-5768 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.1, 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial I am running Spark on Yarn with 2 executors. The executors are running on separate physical machines. I have spark.executor.memory set to '40g'. This is because I want to have 40g of memory used on each machine. I have one executor per machine. When I run my application I see from 'top' that both my executors are using the full 40g of memory I allocated to them. The 'Executors' tab in the Spark UI shows something different. It shows the memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it look like I only have 20GB available per executor when really I have 40GB available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279962#comment-14279962 ] Al M commented on SPARK-5270: - Good point it's not a catch-all solution. The rdd.partitions.size solution does work well in the case of empty RDDs created by Spark streaming. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280260#comment-14280260 ] Al M commented on SPARK-5270: - I don't mind at all. I'd be really happy to have such a utility method in Spark. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278983#comment-14278983 ] Al M commented on SPARK-5270: - I just noticed that rdd.partitions.size is set to 0 for empty RDDs and 0 for RDDs with data; this is a far more elegant check than the others. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5270) Elegantly check if RDD is empty
Al M created SPARK-5270: --- Summary: Elegantly check if RDD is empty Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 This is especially a problem when using streams. Sometimes my batches are huge in one stream, sometimes i get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can also run first() and catch the exception; this is neither a clean nor fast solution. I'd like a method rdd.isEmpty that returns a boolean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-5270: Description: Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. was: Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 This is especially a problem when using streams. Sometimes my batches are huge in one stream, sometimes i get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can also run first() and catch the exception; this is neither a clean nor fast solution. I'd like a method rdd.isEmpty that returns a boolean. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263 ] Al M commented on SPARK-5137: - That's right. {code}a{code} has 11 partitions and {code}b{code} has a lot more. I can see why you wouldn't want to force a shuffle on {code}a{code} since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263 ] Al M edited comment on SPARK-5137 at 1/8/15 12:30 PM: -- That's right. a has 11 partitions and b has a lot more. I can see why you wouldn't want to force a shuffle on a since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. was (Author: alrocks47): That's right. _emphasis_a_emphasis_ has 11 partitions and _emphasis_b_emphasis_ has a lot more. I can see why you wouldn't want to force a shuffle on _emphasis_a_emphasis_ since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263 ] Al M edited comment on SPARK-5137 at 1/8/15 12:30 PM: -- That's right. _emphasis_a_emphasis_ has 11 partitions and _emphasis_b_emphasis_ has a lot more. I can see why you wouldn't want to force a shuffle on _emphasis_a_emphasis_ since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. was (Author: alrocks47): That's right. {code}a{code} has 11 partitions and {code}b{code} has a lot more. I can see why you wouldn't want to force a shuffle on {code}a{code} since that's unnecessary processing. Thanks for your detailed explanation and quick response. I'll close this since I agree that it behaves correctly. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M closed SPARK-5137. --- Resolution: Not a Problem subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account
[ https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268969#comment-14268969 ] Al M commented on SPARK-5137: - Yes I do mean subtractByKey. Sorry for not being clear. I'm new to Spark and it could be that I just don't understand something correctly. I put below a more detailed description of the results I saw. I have default parallelism set to 160 since I am limited for memory and I am working with a lot of data. * Map is run [11 tasks] * Filter is run [2 tasks] * Join with another RDD and run map [160 tasks] * Jain with another RDD and Map again [160 tasks] * SubtractByKey is run [11 tasks] In the last step I run out of memory because subtractByKey was only split into 11 tasks. If I override the partitions to 160 then it works fine. I thought that subtractByKey would use the default parallelism just like the other tasks after the join. If the expected solution is that I override the partitions in my call, I'm fine with that. So far I managed to avoid setting it in any calls and just setting the default parallelism instead. I was concerned that the behavior observed was part of an actual issue. subtract does not take the spark.default.parallelism into account - Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5137) subtract does not take the spark.default.parallelism into account
Al M created SPARK-5137: --- Summary: subtract does not take the spark.default.parallelism into account Key: SPARK-5137 URL: https://issues.apache.org/jira/browse/SPARK-5137 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: CENTOS 6; scala Reporter: Al M Priority: Trivial The 'subtract' function (PairRDDFunctions.scala) in scala does not use the default parallelism value set in the config (spark.default.parallelism). This is easy enough to work around. I can just load the property and pass it in as an argument. It would be great if subtract used the default value, just like all the other PairRDDFunctions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org