[jira] [Comment Edited] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-07-04 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532910#comment-16532910
 ] 

Al M edited comment on SPARK-24474 at 7/4/18 4:17 PM:
--

My initial tests suggest that this stops the issue from happening.  Thanks!  I 
will perform more tests to make 100% sure that it does not still occur.

 

I am surprised that this config makes a difference.  My tasks are usually quite 
big; normally taking about a minute each.  I would not have expected a change 
from waiting 3s per task to 0s per task to make such a huge difference.

Do you know if there is any unexpected logic around this config setting?


was (Author: alrocks46):
My initial tests suggest that this stops the issue from happening.  Thanks!  I 
will perform more tests to make 100% sure that it does not still occur.

 

I am surprised that this config makes a difference.  My tasks are usually quite 
big; normally taking about a minute each.  I would not have expected a change 
from waiting 3s per task to 0s per task to make such a huge difference.

Do you know if there is any unusual behaviour around this config setting?

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-07-04 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532910#comment-16532910
 ] 

Al M commented on SPARK-24474:
--

My initial tests suggest that this stops the issue from happening.  Thanks!  I 
will perform more tests to make 100% sure that it does not still occur.

 

I am surprised that this config makes a difference.  My tasks are usually quite 
big; normally taking about a minute each.  I would not have expected a change 
from waiting 3s per task to 0s per task to make such a huge difference.

Do you know if there is any unusual behaviour around this config setting?

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)

2018-07-04 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532534#comment-16532534
 ] 

Al M commented on SPARK-13127:
--

Would be great to get this resolved in Spark 2.3.2.  Especially since Parquet 
1.9 supports delta encoding: https://issues.apache.org/jira/browse/PARQUET-225

> Upgrade Parquet to 1.9 (Fixes parquet sorting)
> --
>
> Key: SPARK-13127
> URL: https://issues.apache.org/jira/browse/SPARK-13127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Justin Pihony
>Priority: Major
>
> Currently, when you write a sorted DataFrame to Parquet, then reading the 
> data back out is not sorted by default. [This is due to a bug in 
> Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in 
> 1.9.
> There is a workaround to read the file back in using a file glob (filepath/*).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-06-12 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Summary: Cores are left idle when there are a lot of tasks to run  (was: 
Cores are left idle when there are a lot of stages to run)

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-11 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507965#comment-16507965
 ] 

Al M commented on SPARK-24474:
--

Also tried changing spark.scheduler.mode to "FIFO"; that didn't fix the problem 
either.

> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-07 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504730#comment-16504730
 ] 

Al M commented on SPARK-24474:
--

Sample code that reproduces this:
{code:java}
val builder = SparkSession.builder()
val spark = builder.appName("IdleCores").getOrCreate()

val optimalMinPartitions = 1
spark.conf.set("spark.default.parallelism", optimalMinPartitions)
spark.conf.set("spark.sql.shuffle.partitions", optimalMinPartitions)

val input1 = spark.read.parquet("/input1") //Must be big
val input2 = spark.read.parquet("/input2") // Must be 10x as big as input1

val joined = input1.join(input2, Seq("key1", "key2"))
joined.agg(Map("somevalue" -> "SUM")).show()
{code}

> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-07 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled
 * The stage that hangs (referred to as "Second stage" above) has a lower 
'Stage Id' than the first one that completes
 * This happens with spark.shuffle.service.enabled set to true and false

  was:
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled
 * The stage that hangs (referred to as "Second stage" above) has a lower 
'Stage Id' than the first one that completes


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores 

[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled
 * The stage that hangs (referred to as "Second stage" above) has a lower 
'Stage Id' than the first one that completes

  was:
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The 

[jira] [Commented] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503225#comment-16503225
 ] 

Al M commented on SPARK-24474:
--

I appreciate that 2.2.0 is slightly old but I couldn't see any scheduler fixes 
in later versions that sounded like this.

> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled
 * It happens when dynamic allocation is enabled, and when it is disabled

  was:
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Created] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)
Al M created SPARK-24474:


 Summary: Cores are left idle when there are a lot of stages to run
 Key: SPARK-24474
 URL: https://issues.apache.org/jira/browse/SPARK-24474
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Al M


I've observed an issue happening consistently when:
* A job contains a join of two datasets
* One dataset is much larger than the other
* Both datasets require some processing before they are joined

What I have observed is:
* 2 stages are initially active to run processing on the two datasets
** These stages are run in parallel
** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
** Spark allocates a similar (though not exactly equal) number of cores to each 
stage
* First stage completes (for the smaller dataset)
** Now there is only one stage running
** It still has many tasks left (usually > 20k tasks)
** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
** This continues until the second stage completes
* Second stage completes, and third begins (the stage that actually joins the 
data)
** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
* It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to other stages
* I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of stages to run

2018-06-06 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Description: 
I've observed an issue happening consistently when:
 * A job contains a join of two datasets
 * One dataset is much larger than the other
 * Both datasets require some processing before they are joined

What I have observed is:
 * 2 stages are initially active to run processing on the two datasets
 ** These stages are run in parallel
 ** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
 ** Spark allocates a similar (though not exactly equal) number of cores to 
each stage
 * First stage completes (for the smaller dataset)
 ** Now there is only one stage running
 ** It still has many tasks left (usually > 20k tasks)
 ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
 ** This continues until the second stage completes
 * Second stage completes, and third begins (the stage that actually joins the 
data)
 ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
 * It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to existing stages
 * Once all active stages are done, we release all cores to new stages
 * I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled

  was:
I've observed an issue happening consistently when:
* A job contains a join of two datasets
* One dataset is much larger than the other
* Both datasets require some processing before they are joined

What I have observed is:
* 2 stages are initially active to run processing on the two datasets
** These stages are run in parallel
** One stage has significantly more tasks than the other (e.g. one has 30k 
tasks and the other has 2k tasks)
** Spark allocates a similar (though not exactly equal) number of cores to each 
stage
* First stage completes (for the smaller dataset)
** Now there is only one stage running
** It still has many tasks left (usually > 20k tasks)
** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 103)
** This continues until the second stage completes
* Second stage completes, and third begins (the stage that actually joins the 
data)
** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
tasks = 200)

Other interesting things about this:
* It seems that when we have multiple stages active, and one of them finishes, 
it does not actually release any cores to other stages
* I can't reproduce this locally on my machine, only on a cluster with YARN 
enabled


> Cores are left idle when there are a lot of stages to run
> -
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy)

2018-05-17 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24306:
-
Summary: Sort a Dataset with a lambda (like RDD.sortBy)  (was: Sort a 
Dataset with a lambda (like RDD.sortBy() ))

> Sort a Dataset with a lambda (like RDD.sortBy)
> --
>
> Key: SPARK-24306
> URL: https://issues.apache.org/jira/browse/SPARK-24306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Al M
>Priority: Minor
>
> Dataset has many useful functions that do not require hard coded column names 
> (e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
> the column name as a string. 
> It would be nice to have a function similar to the sortBy in RDD where i can 
> define the keys in a lambda e.g.
> {code:java}
> ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy() )

2018-05-17 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24306:
-
Summary: Sort a Dataset with a lambda (like RDD.sortBy() )  (was: Sort a 
Dataset with a lambda (like RDD.sortBy())

> Sort a Dataset with a lambda (like RDD.sortBy() )
> -
>
> Key: SPARK-24306
> URL: https://issues.apache.org/jira/browse/SPARK-24306
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Al M
>Priority: Minor
>
> Dataset has many useful functions that do not require hard coded column names 
> (e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
> the column name as a string. 
> It would be nice to have a function similar to the sortBy in RDD where i can 
> define the keys in a lambda e.g.
> {code:java}
> ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24306) Sort a Dataset with a lambda (like RDD.sortBy()

2018-05-17 Thread Al M (JIRA)
Al M created SPARK-24306:


 Summary: Sort a Dataset with a lambda (like RDD.sortBy()
 Key: SPARK-24306
 URL: https://issues.apache.org/jira/browse/SPARK-24306
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Al M


Dataset has many useful functions that do not require hard coded column names 
(e.g. flatMapGroups, map).  Currently the sort() function requires us to pass 
the column name as a string. 

It would be nice to have a function similar to the sortBy in RDD where i can 
define the keys in a lambda e.g.
{code:java}
ds.sortBy(record => record.id){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly

2016-04-12 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M closed SPARK-14532.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Spark SQL IF/ELSE does not handle Double correctly
> --
>
> Key: SPARK-14532
> URL: https://issues.apache.org/jira/browse/SPARK-14532
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Al M
> Fix For: 2.0.0
>
>
> I am using Spark SQL to add new columns to my data.  Below is an example 
> snipped in Scala:
> {code}myDF.withColumn("newcol", new 
> Column(SqlParser.parseExpression(sparkSqlExpr))).show{code}
> *What Works*
> If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected.
> If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as 
> expected.
> If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as 
> expected.
> *What does not Work*
> If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error 
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 
> 0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 
> 0.0' (decimal(2,1) and decimal(1,1)).;
> If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = 
> "IF(1=1, 1.0, 0.0)" then I see error   
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 
> 10.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 
> 10.0' (decimal(2,1) and decimal(3,1)).;
> If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error 
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else 
> 1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else 
> 1.11' (decimal(2,1) and decimal(3,2)).;
> It looks like the Spark SQL typing system is seeing doubles as different 
> types depending on the number of digits before and after the decimal point



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly

2016-04-12 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237058#comment-15237058
 ] 

Al M commented on SPARK-14532:
--

Thanks Bo.  I'll look forward to the 2.0 release :)

> Spark SQL IF/ELSE does not handle Double correctly
> --
>
> Key: SPARK-14532
> URL: https://issues.apache.org/jira/browse/SPARK-14532
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Al M
> Fix For: 2.0.0
>
>
> I am using Spark SQL to add new columns to my data.  Below is an example 
> snipped in Scala:
> {code}myDF.withColumn("newcol", new 
> Column(SqlParser.parseExpression(sparkSqlExpr))).show{code}
> *What Works*
> If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected.
> If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as 
> expected.
> If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as 
> expected.
> *What does not Work*
> If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error 
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 
> 0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 
> 0.0' (decimal(2,1) and decimal(1,1)).;
> If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = 
> "IF(1=1, 1.0, 0.0)" then I see error   
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 
> 10.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 
> 10.0' (decimal(2,1) and decimal(3,1)).;
> If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error 
> org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else 
> 1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else 
> 1.11' (decimal(2,1) and decimal(3,2)).;
> It looks like the Spark SQL typing system is seeing doubles as different 
> types depending on the number of digits before and after the decimal point



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14532) Spark SQL IF/ELSE does not handle Double correctly

2016-04-11 Thread Al M (JIRA)
Al M created SPARK-14532:


 Summary: Spark SQL IF/ELSE does not handle Double correctly
 Key: SPARK-14532
 URL: https://issues.apache.org/jira/browse/SPARK-14532
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Al M


I am using Spark SQL to add new columns to my data.  Below is an example 
snipped in Scala:
{code}myDF.withColumn("newcol", new 
Column(SqlParser.parseExpression(sparkSqlExpr))).show{code}

*What Works*
If sparkSqlExpr = "IF(1=1, 1, 0)" then i see 1 in the result as expected.
If sparkSqlExpr = "IF(1=1, 1.0, 1.5)" then i see 1.0 in the result as expected.
If sparkSqlExpr = "IF(1=1, 'A', 'B')" then i see 'A' in the result as expected.

*What does not Work*
If sparkSqlExpr = "IF(1=1, 1.0, 0.0)" then I see error 
org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.0 else 
0.0' due to data type mismatch: differing types in 'if ((1 = 1)) 1.0 else 0.0' 
(decimal(2,1) and decimal(1,1)).;

If sparkSqlExpr = "IF(1=1, 1.0, 10.0)" then I see error If sparkSqlExpr = 
"IF(1=1, 1.0, 0.0)" then I see error   org.apache.spark.sql.AnalysisException: 
cannot resolve 'if ((1 = 1)) 1.0 else 10.0' due to data type mismatch: 
differing types in 'if ((1 = 1)) 1.0 else 10.0' (decimal(2,1) and 
decimal(3,1)).;

If sparkSqlExpr = "IF(1=1, 1.1, 1.11)" then I see error 
org.apache.spark.sql.AnalysisException: cannot resolve 'if ((1 = 1)) 1.1 else 
1.11' due to data type mismatch: differing types in 'if ((1 = 1)) 1.1 else 
1.11' (decimal(2,1) and decimal(3,2)).;

It looks like the Spark SQL typing system is seeing doubles as different types 
depending on the number of digits before and after the decimal point



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins

2015-08-18 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701158#comment-14701158
 ] 

Al M commented on SPARK-9872:
-

I would also be happy if we just get the partition count from the parents.  
That would be even better than setting it manually since that's all my code 
would be doing anyway.

Right now it is always using a fixed default number, which isn't much use if my 
spark application uses lots of different files of different sizes.

 Allow passing of 'numPartitions' to DataFrame joins
 ---

 Key: SPARK-9872
 URL: https://issues.apache.org/jira/browse/SPARK-9872
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor

 When I join two normal RDDs, I can set the number of shuffle partitions in 
 the 'numPartitions' argument.  When I join two DataFrames I do not have this 
 option.
 My spark job loads in 2 large files and 2 small files.  When I perform a 
 join, this will use the spark.sql.shuffle.partitions to determine the 
 number of partitions.  This means that the join with my small files will use 
 exactly the same number of partitions as the join with my large files.
 I can either use a low number of partitions and run out of memory on my large 
 join, or use a high number of partitions and my small join will take far too 
 long.
 If we were able to specify the number of shuffle partitions in a DataFrame 
 join like in an RDD join, this would not be an issue.
 My long term ideal solution would be dynamic partition determination as 
 described in SPARK-4630.  However I appreciate that it is not particularly 
 easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins

2015-08-12 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-9872:

Description: 
When I join two normal RDDs, I can set the number of shuffle partitions in the 
'numPartitions' argument.  When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files.  When I perform a join, 
this will use the spark.sql.shuffle.partitions to determine the number of 
partitions.  This means that the join with my small files will use exactly the 
same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large 
join, or use a high number of partitions and my small join will take far too 
long.

If we were able to specify the number of shuffle partitions in a DataFrame join 
like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as 
described in SPARK-4630.  However I appreciate that it is not particularly easy 
to do.

  was:
When I join two normal RDDs, I can set the number of shuffle partitions in the 
'numPartitions' argument.  When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files.  When I perform a join, 
this will use the spark.sql.shuffle.partitions to determine the number of 
partitions.  This means that the join with my small files will use exactliy the 
same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large 
join, or use a high number of partitions and my small join will take far too 
long.

If we were able to specify the number of shuffle partitions in a DataFrame join 
like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as 
described in SPARK-4630.  However I appreciate that it is not particularly easy 
to do.


 Allow passing of 'numPartitions' to DataFrame joins
 ---

 Key: SPARK-9872
 URL: https://issues.apache.org/jira/browse/SPARK-9872
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor

 When I join two normal RDDs, I can set the number of shuffle partitions in 
 the 'numPartitions' argument.  When I join two DataFrames I do not have this 
 option.
 My spark job loads in 2 large files and 2 small files.  When I perform a 
 join, this will use the spark.sql.shuffle.partitions to determine the 
 number of partitions.  This means that the join with my small files will use 
 exactly the same number of partitions as the join with my large files.
 I can either use a low number of partitions and run out of memory on my large 
 join, or use a high number of partitions and my small join will take far too 
 long.
 If we were able to specify the number of shuffle partitions in a DataFrame 
 join like in an RDD join, this would not be an issue.
 My long term ideal solution would be dynamic partition determination as 
 described in SPARK-4630.  However I appreciate that it is not particularly 
 easy to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9872) Allow passing of 'numPartitions' to DataFrame joins

2015-08-12 Thread Al M (JIRA)
Al M created SPARK-9872:
---

 Summary: Allow passing of 'numPartitions' to DataFrame joins
 Key: SPARK-9872
 URL: https://issues.apache.org/jira/browse/SPARK-9872
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.1
Reporter: Al M
Priority: Minor


When I join two normal RDDs, I can set the number of shuffle partitions in the 
'numPartitions' argument.  When I join two DataFrames I do not have this option.

My spark job loads in 2 large files and 2 small files.  When I perform a join, 
this will use the spark.sql.shuffle.partitions to determine the number of 
partitions.  This means that the join with my small files will use exactliy the 
same number of partitions as the join with my large files.

I can either use a low number of partitions and run out of memory on my large 
join, or use a high number of partitions and my small join will take far too 
long.

If we were able to specify the number of shuffle partitions in a DataFrame join 
like in an RDD join, this would not be an issue.

My long term ideal solution would be dynamic partition determination as 
described in SPARK-4630.  However I appreciate that it is not particularly easy 
to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-02-12 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318088#comment-14318088
 ] 

Al M commented on SPARK-5768:
-

So when it says *Memory Used* 3.2GB / 20GB it actually means we are using 3.2GB 
of memory for caching out of a total 20GB available for caching?

Calling the the column 'Storage Memory' this would be clearer to me.  If 
changing the heading of the column is not an option then a tooltip explaining 
that it is referring to memory used for storage.

I'd find it pretty useful to have another column that shows my total memory 
usage.  Right now I can only see this by running 'free' or 'top' every machine 
or looking at the Yarn UI.

 Spark UI Shows incorrect memory under Yarn
 --

 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0, 1.2.1
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 I am running Spark on Yarn with 2 executors.  The executors are running on 
 separate physical machines.
 I have spark.executor.memory set to '40g'.  This is because I want to have 
 40g of memory used on each machine.  I have one executor per machine.
 When I run my application I see from 'top' that both my executors are using 
 the full 40g of memory I allocated to them.
 The 'Executors' tab in the Spark UI shows something different.  It shows the 
 memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
 look like I only have 20GB available per executor when really I have 40GB 
 available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-02-12 Thread Al M (JIRA)
Al M created SPARK-5768:
---

 Summary: Spark UI Shows incorrect memory under Yarn
 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1, 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial


I am running Spark on Yarn with 2 executors.  The executors are running on 
separate physical machines.

I have spark.executor.memory set to '40g'.  This is because I want to have 40g 
of memory used on each machine.  I have one executor per machine.

When I run my application I see from 'top' that both my executors are using the 
full 40g of memory I allocated to them.

The 'Executors' tab in the Spark UI shows something different.  It shows the 
memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
look like I only have 20GB available per executor when really I have 40GB 
available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279962#comment-14279962
 ] 

Al M commented on SPARK-5270:
-

Good point it's not a catch-all solution.  The rdd.partitions.size solution 
does work well in the case of empty RDDs created by Spark streaming.

 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-16 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280260#comment-14280260
 ] 

Al M commented on SPARK-5270:
-

I don't mind at all.  I'd be really happy to have such a utility method in 
Spark.

 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278983#comment-14278983
 ] 

Al M commented on SPARK-5270:
-

I just noticed that rdd.partitions.size is set to 0 for empty RDDs and  0 for 
RDDs with data; this is a far more elegant check than the others.

 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)
Al M created SPARK-5270:
---

 Summary: Elegantly check if RDD is empty
 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial


Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams.  Sometimes my batches are huge 
in one stream, sometimes i get nothing for hours.  Still I have to run count() 
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-5270:

Description: 
Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

I'd like a method rdd.isEmpty that returns a boolean.

This would be especially useful when using streams.  Sometimes my batches are 
huge in one stream, sometimes I get nothing for hours.  Still I have to run 
count() to check if there is anything in the RDD.  I can process my empty RDD 
like the others but it would be more efficient to just skip the empty ones.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.



  was:
Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams.  Sometimes my batches are huge 
in one stream, sometimes i get nothing for hours.  Still I have to run count() 
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.


 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263
 ] 

Al M commented on SPARK-5137:
-

That's right.  {code}a{code} has 11 partitions and {code}b{code} has a lot 
more.  I can see why you wouldn't want to force a shuffle on {code}a{code} 
since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263
 ] 

Al M edited comment on SPARK-5137 at 1/8/15 12:30 PM:
--

That's right.  a has 11 partitions and b has a lot more.  I can see why you 
wouldn't want to force a shuffle on a since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.


was (Author: alrocks47):
That's right.  _emphasis_a_emphasis_ has 11 partitions and 
_emphasis_b_emphasis_ has a lot more.  I can see why you wouldn't want to force 
a shuffle on _emphasis_a_emphasis_ since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269263#comment-14269263
 ] 

Al M edited comment on SPARK-5137 at 1/8/15 12:30 PM:
--

That's right.  _emphasis_a_emphasis_ has 11 partitions and 
_emphasis_b_emphasis_ has a lot more.  I can see why you wouldn't want to force 
a shuffle on _emphasis_a_emphasis_ since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.


was (Author: alrocks47):
That's right.  {code}a{code} has 11 partitions and {code}b{code} has a lot 
more.  I can see why you wouldn't want to force a shuffle on {code}a{code} 
since that's unnecessary processing.

Thanks for your detailed explanation and quick response.  I'll close this since 
I agree that it behaves correctly.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M closed SPARK-5137.
---
Resolution: Not a Problem

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-08 Thread Al M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268969#comment-14268969
 ] 

Al M commented on SPARK-5137:
-

Yes I do mean subtractByKey.  Sorry for not being clear.

I'm new to Spark and it could be that I just don't understand something 
correctly.  I put below a more detailed description of the results I saw.  I 
have default parallelism set to 160 since I am limited for memory and I am 
working with a lot of data.

* Map is run [11 tasks]
* Filter is run [2 tasks]
* Join with another RDD and run map [160 tasks]
* Jain with another RDD and Map again [160 tasks]
* SubtractByKey is run [11 tasks]

In the last step I run out of memory because subtractByKey was only split into 
11 tasks.  If I override the partitions to 160 then it works fine.  I thought 
that subtractByKey would use the default parallelism just like the other tasks 
after the join.

If the expected solution is that I override the partitions in my call, I'm fine 
with that.  So far I managed to avoid setting it in any calls and just setting 
the default parallelism instead.  I was concerned that the behavior observed 
was part of an actual issue.

 subtract does not take the spark.default.parallelism into account
 -

 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial

 The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
 default parallelism value set in the config (spark.default.parallelism).  
 This is easy enough to work around.  I can just load the property and pass it 
 in as an argument.
 It would be great if subtract used the default value, just like all the other 
 PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5137) subtract does not take the spark.default.parallelism into account

2015-01-07 Thread Al M (JIRA)
Al M created SPARK-5137:
---

 Summary: subtract does not take the spark.default.parallelism into 
account
 Key: SPARK-5137
 URL: https://issues.apache.org/jira/browse/SPARK-5137
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: CENTOS 6; scala
Reporter: Al M
Priority: Trivial


The 'subtract' function (PairRDDFunctions.scala) in scala does not use the 
default parallelism value set in the config (spark.default.parallelism).  This 
is easy enough to work around.  I can just load the property and pass it in as 
an argument.

It would be great if subtract used the default value, just like all the other 
PairRDDFunctions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org