[jira] [Commented] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15988953#comment-15988953 ] Bill Chambers commented on SPARK-20496: --- This should probably be backported too. > KafkaWriter Uses Unanalyzed Logical Plan > > > Key: SPARK-20496 > URL: https://issues.apache.org/jira/browse/SPARK-20496 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Bill Chambers > > Right now we use the unanalyzed logical plan for writing to Kafka, we should > use the analyzed plan. > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-20496: -- Affects Version/s: 2.1.0 > KafkaWriter Uses Unanalyzed Logical Plan > > > Key: SPARK-20496 > URL: https://issues.apache.org/jira/browse/SPARK-20496 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Bill Chambers > > Right now we use the unanalyzed logical plan for writing to Kafka, we should > use the analyzed plan. > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan
[ https://issues.apache.org/jira/browse/SPARK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-20496: -- Description: Right now we use the unanalyzed logical plan for writing to Kafka, we should use the analyzed plan. https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50 was:Right now we use the unanalyzed logical plan for writing to Kafka, we should use the analyzed plan. > KafkaWriter Uses Unanalyzed Logical Plan > > > Key: SPARK-20496 > URL: https://issues.apache.org/jira/browse/SPARK-20496 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bill Chambers > > Right now we use the unanalyzed logical plan for writing to Kafka, we should > use the analyzed plan. > https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L50 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20496) KafkaWriter Uses Unanalyzed Logical Plan
Bill Chambers created SPARK-20496: - Summary: KafkaWriter Uses Unanalyzed Logical Plan Key: SPARK-20496 URL: https://issues.apache.org/jira/browse/SPARK-20496 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Bill Chambers Right now we use the unanalyzed logical plan for writing to Kafka, we should use the analyzed plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation
[ https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976102#comment-15976102 ] Bill Chambers commented on SPARK-20400: --- I'd like to see what others have to say, maybe this isn't a big deal. But it does seem like it's a fairly explicit vendor reference. I cede the discussion to the community, happy either way but wanted to mention it. > Remove References to Third Party Vendors from Spark ASF Documentation > - > > Key: SPARK-20400 > URL: https://issues.apache.org/jira/browse/SPARK-20400 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Bill Chambers > Fix For: 2.3.0 > > > Similar to SPARK-17445, vendors should probably not be referenced on the ASF > documentation. > Related: > https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation
[ https://issues.apache.org/jira/browse/SPARK-20400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-20400: -- Description: Similar to SPARK-17445, vendors should probably not be referenced on the ASF documentation. Related: https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636 > Remove References to Third Party Vendors from Spark ASF Documentation > - > > Key: SPARK-20400 > URL: https://issues.apache.org/jira/browse/SPARK-20400 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Bill Chambers > Fix For: 2.3.0 > > > Similar to SPARK-17445, vendors should probably not be referenced on the ASF > documentation. > Related: > https://github.com/apache/spark/commit/dc0a4c916151c795dc41b5714e9d23b4937f4636 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20400) Remove References to Third Party Vendors from Spark ASF Documentation
Bill Chambers created SPARK-20400: - Summary: Remove References to Third Party Vendors from Spark ASF Documentation Key: SPARK-20400 URL: https://issues.apache.org/jira/browse/SPARK-20400 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.1.0 Reporter: Bill Chambers Fix For: 2.3.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886109#comment-15886109 ] Bill Chambers commented on SPARK-19714: --- Agree with your first and second paragraphs. Regarding the third, it's worth a discussion certainly, but it's a pretty big departure from the current definition which is worrisome. > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883125#comment-15883125 ] Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM: The thing is QuantileDiscretizer and Bucketizer do fundamentally different things so there are different use cases there (quantiles vs actual values). It's more of a nuisance than anything and an unclear parameter that seems to imply things that are not actually the case. Here's where it *really* falls apart, if I have a bucket and I provide one split, how many buckets do I have? In Bucketizer I have none! That makes little sense. Splits is not the correct word here either because they aren't splits! They're bucket boundaries. I think this is more than a documentation issue, even though those aren't very clear themselves. > Parameter for mapping continuous features into buckets. With n+1 splits, > there are n buckets. A bucket defined by splits x,y holds values in the range > [x,y) except the last bucket, which also includes y. Splits should be of > length greater than or equal to 3 and strictly increasing. Values at -inf, > inf must be explicitly provided to cover all Double values; otherwise, values > outside the splits specified will be treated as errors. I also realize I'm being a pain here :) and that this stuff is always difficult. I empathize with that, it's just that this method doesn't seem to use correct terminology or a conceptually relevant implementation for what it aims to do. was (Author: bill_chambers): The thing is QuantileDiscretizer and Bucketizer do fundamentally different things so there are different use cases there (quantiles vs actual values). It's more of a nuisance than anything and an unclear parameter that seems to imply things that are not actually the case. Here's where it *really* falls apart, if I have a bucket and I provide one split, how many buckets do I have? In Bucketizer I have none! That makes no sense. Splits is not the correct word here either because they aren't splits! They're bounds or containers or buckets themselves. I think this is more than a documentation issue, even though those aren't very clear themselves. > Parameter for mapping continuous features into buckets. With n+1 splits, > there are n buckets. A bucket defined by splits x,y holds values in the range > [x,y) except the last bucket, which also includes y. Splits should be of > length greater than or equal to 3 and strictly increasing. Values at -inf, > inf must be explicitly provided to cover all Double values; otherwise, values > outside the splits specified will be treated as errors. > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883125#comment-15883125 ] Bill Chambers commented on SPARK-19714: --- The thing is QuantileDiscretizer and Bucketizer do fundamentally different things so there are different use cases there (quantiles vs actual values). It's more of a nuisance than anything and an unclear parameter that seems to imply things that are not actually the case. Here's where it *really* falls apart, if I have a bucket and I provide one split, how many buckets do I have? In Bucketizer I have none! That makes no sense. Splits is not the correct word here either because they aren't splits! They're bounds or containers or buckets themselves. I think this is more than a documentation issue, even though those aren't very clear themselves. > Parameter for mapping continuous features into buckets. With n+1 splits, > there are n buckets. A bucket defined by splits x,y holds values in the range > [x,y) except the last bucket, which also includes y. Splits should be of > length greater than or equal to 3 and strictly increasing. Values at -inf, > inf must be explicitly provided to cover all Double values; otherwise, values > outside the splits specified will be treated as errors. > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
[ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881676#comment-15881676 ] Bill Chambers commented on SPARK-19714: --- "Invalid" is a poor descriptor IMO. Invalid should be defined as "not defined in this range". If it's null, why isn't it just "handleNull" or something since it only applies to null/missing values? A doc update would definitely help. I've got my own opinions about how this should work but I'll leave it up to you. Be curious if anyone else has thoughts, maybe I'm the only one in which case... whatever :) > Bucketizer Bug Regarding Handling Unbucketed Inputs > --- > > Key: SPARK-19714 > URL: https://issues.apache.org/jira/browse/SPARK-19714 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Bill Chambers > > {code} > contDF = spark.range(500).selectExpr("cast(id as double) as id") > import org.apache.spark.ml.feature.Bucketizer > val splits = Array(5.0, 10.0, 250.0, 500.0) > val bucketer = new Bucketizer() > .setSplits(splits) > .setInputCol("id") > .setHandleInvalid("skip") > bucketer.transform(contDF).show() > {code} > You would expect that this would handle the invalid buckets. However it fails > {code} > Caused by: org.apache.spark.SparkException: Feature value 0.0 out of > Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the > lower/upper bound constraints. > {code} > It seems strange that handleInvalud doesn't actually handleInvalid inputs. > Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs
Bill Chambers created SPARK-19714: - Summary: Bucketizer Bug Regarding Handling Unbucketed Inputs Key: SPARK-19714 URL: https://issues.apache.org/jira/browse/SPARK-19714 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Bill Chambers {code} contDF = spark.range(500).selectExpr("cast(id as double) as id") import org.apache.spark.ml.feature.Bucketizer val splits = Array(5.0, 10.0, 250.0, 500.0) val bucketer = new Bucketizer() .setSplits(splits) .setInputCol("id") .setHandleInvalid("skip") bucketer.transform(contDF).show() {code} You would expect that this would handle the invalid buckets. However it fails {code} Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds [5.0, 500.0]. Check your features, or loosen the lower/upper bound constraints. {code} It seems strange that handleInvalud doesn't actually handleInvalid inputs. Thoughts anyone? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19127) Inconsistencies in dense_rank and rank documentation
[ https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15809726#comment-15809726 ] Bill Chambers commented on SPARK-19127: --- https://github.com/apache/spark/pull/16505 > Inconsistencies in dense_rank and rank documentation > > > Key: SPARK-19127 > URL: https://issues.apache.org/jira/browse/SPARK-19127 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > The docs were not updated during the change from things like denseRank to > dense_rank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19126: -- Priority: Minor (was: Major) > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > - Some join types are missing (no mention of anti join) > - Joins are labelled inconsistently both within each language and between > languages. > - Update according to new join spec for `crossJoin` > Pull request coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19127) Inconsistencies in dense_rank and rank documentation
[ https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19127: -- Priority: Minor (was: Major) > Inconsistencies in dense_rank and rank documentation > > > Key: SPARK-19127 > URL: https://issues.apache.org/jira/browse/SPARK-19127 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > The docs were not updated during the change from things like denseRank to > dense_rank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19127) Inconsistencies in dense_rank and rank documentation
[ https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19127: -- Summary: Inconsistencies in dense_rank and rank documentation (was: Errors in Window Functions Documentation) > Inconsistencies in dense_rank and rank documentation > > > Key: SPARK-19127 > URL: https://issues.apache.org/jira/browse/SPARK-19127 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > The docs were not updated during the change from things like denseRank to > dense_rank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19127) Errors in Window Functions Documentation
Bill Chambers created SPARK-19127: - Summary: Errors in Window Functions Documentation Key: SPARK-19127 URL: https://issues.apache.org/jira/browse/SPARK-19127 Project: Spark Issue Type: Improvement Reporter: Bill Chambers The docs were not updated during the change from things like denseRank to dense_rank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19127) Errors in Window Functions Documentation
[ https://issues.apache.org/jira/browse/SPARK-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15809697#comment-15809697 ] Bill Chambers commented on SPARK-19127: --- PR coming > Errors in Window Functions Documentation > > > Key: SPARK-19127 > URL: https://issues.apache.org/jira/browse/SPARK-19127 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > The docs were not updated during the change from things like denseRank to > dense_rank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15809690#comment-15809690 ] Bill Chambers commented on SPARK-19126: --- PR Ready: https://github.com/apache/spark/pull/16504 > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > - Some join types are missing (no mention of anti join) > - Joins are labelled inconsistently both within each language and between > languages. > - Update according to new join spec for `crossJoin` > Pull request coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19126: -- Description: - Some join types are missing (no mention of anti join) - Joins are labelled inconsistently both within each language and between languages. - Update according to new join spec for `crossJoin` Pull request coming... was: - Some join types are missing (no mention of anti join) - Joins are labelled inconsistently both within each language and between languages. Pull request coming... > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > - Some join types are missing (no mention of anti join) > - Joins are labelled inconsistently both within each language and between > languages. > - Update according to new join spec for `crossJoin` > Pull request coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19126: -- Description: - Some join types are missing (no mention of anti join) - Joins are labelled inconsistently both within each language and between languages. Pull request coming... was: - Some join types are missing or inconsistent (no mention of anti join) Pull request coming... > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > - Some join types are missing (no mention of anti join) > - Joins are labelled inconsistently both within each language and between > languages. > Pull request coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-19126: -- Description: - Some join types are missing or inconsistent (no mention of anti join) Pull request coming... was: Pull request coming... Some join types are missing or inconsistent. Summary: Join Documentation Improvements (was: Join Documentation Incomplete) > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers > > - Some join types are missing or inconsistent (no mention of anti join) > Pull request coming... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19126) Join Documentation Incomplete
Bill Chambers created SPARK-19126: - Summary: Join Documentation Incomplete Key: SPARK-19126 URL: https://issues.apache.org/jira/browse/SPARK-19126 Project: Spark Issue Type: Improvement Reporter: Bill Chambers Pull request coming... Some join types are missing or inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18424) Single Function for Parsing Dates and Times with Formats
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers resolved SPARK-18424. --- Resolution: Duplicate This is a duplicate of SPARK-16609. Work will continue there. > Single Function for Parsing Dates and Times with Formats > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates
[ https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671632#comment-15671632 ] Bill Chambers commented on SPARK-16609: --- I am working on this. > Single function for parsing timestamps/dates > > > Key: SPARK-16609 > URL: https://issues.apache.org/jira/browse/SPARK-16609 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Reynold Xin > > Today, if you want to parse a date or timestamp, you have to use the unix > time function and then cast to a timestamp. Its a little odd there isn't a > single function that does both. I propose we add > {code} > to_date(, )/to_timestamp(, ). > {code} > For reference, in other systems there are: > MS SQL: {{convert(, )}}. See: > https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx > Netezza: {{to_timestamp(, )}}. See: > https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html > Teradata has special casting functionality: {{cast( as timestamp > format '')}} > MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you > define both date and time parts. See: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Single Function for Parsing Dates and Times with Formats
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Single Function for Parsing Dates and Times with Formats (was: Single Funct) > Single Function for Parsing Dates and Times with Formats > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Single Funct
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Single Funct (was: Improve Date Parsing Semantics & Functionality) > Single Funct > > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Semantics & Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Improve Date Parsing Semantics & Functionality (was: Improve Date Parsing Functionality) > Improve Date Parsing Semantics & Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 10:09 PM: -- For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: ? Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:31 PM: - For the record I would like to work on this one. Define Function here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala Register Function here: ? Add tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. It seems that I will have to add some tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660291#comment-15660291 ] Bill Chambers edited comment on SPARK-18424 at 11/12/16 9:30 PM: - For the record I would like to work on this one. It seems that I will have to add some tests: Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala Here: https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala was (Author: bill_chambers): For the record I would like to work on this one. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660291#comment-15660291 ] Bill Chambers commented on SPARK-18424: --- For the record I would like to work on this one. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Summary: Improve Date Parsing Functionality (was: Cumbersome Date Manipulation) > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. so that you can avoid entirely the > above conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality
[ https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18424: -- Description: I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. I also propose a to_timestamp function that also supports a format. so that you can avoid entirely the above conversion. It's also worth mentioning that many other databases support this. For instance, mysql has the STR_TO_DATE function, netezza supports the to_timestamp semantic. was: I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. so that you can avoid entirely the above conversion. > Improve Date Parsing Functionality > -- > > Key: SPARK-18424 > URL: https://issues.apache.org/jira/browse/SPARK-18424 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Priority: Minor > > I've found it quite cumbersome to work with dates thus far in Spark, it can > be hard to reason about the timeformat and what type you're working with, for > instance: > say that I have a date in the format > {code} > 2017-20-12 > // Y-D-M > {code} > In order to parse that into a Date, I have to perform several conversions. > {code} > to_date( > unix_timestamp(col("date"), dateFormat) > .cast("timestamp")) >.alias("date") > {code} > I propose simplifying this by adding a to_date function (exists) but adding > one that accepts a format for that date. I also propose a to_timestamp > function that also supports a format. > so that you can avoid entirely the above conversion. > It's also worth mentioning that many other databases support this. For > instance, mysql has the STR_TO_DATE function, netezza supports the > to_timestamp semantic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18424) Cumbersome Date Manipulation
Bill Chambers created SPARK-18424: - Summary: Cumbersome Date Manipulation Key: SPARK-18424 URL: https://issues.apache.org/jira/browse/SPARK-18424 Project: Spark Issue Type: Improvement Reporter: Bill Chambers Priority: Minor I've found it quite cumbersome to work with dates thus far in Spark, it can be hard to reason about the timeformat and what type you're working with, for instance: say that I have a date in the format {code} 2017-20-12 // Y-D-M {code} In order to parse that into a Date, I have to perform several conversions. {code} to_date( unix_timestamp(col("date"), dateFormat) .cast("timestamp")) .alias("date") {code} I propose simplifying this by adding a to_date function (exists) but adding one that accepts a format for that date. so that you can avoid entirely the above conversion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Description: The documentation for sample is a little unintuitive. It was difficult to understand why I wasn't getting exactly the fraction specified of my total DataFrame rows. The PR clarifies the documentation for Scala, Python, and R to explain that that is expected behavior. (was: The parameter documentation is switched. PR coming shortly.) > Improve Documentation for Sample Methods > > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The documentation for sample is a little unintuitive. It was difficult to > understand why I wasn't getting exactly the fraction specified of my total > DataFrame rows. The PR clarifies the documentation for Scala, Python, and R > to explain that that is expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Methods
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Summary: Improve Documentation for Sample Methods (was: Improve Documentation for Sample Method) > Improve Documentation for Sample Methods > > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18365) Improve Documentation for Sample Method
[ https://issues.apache.org/jira/browse/SPARK-18365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-18365: -- Summary: Improve Documentation for Sample Method (was: Documentation for Sampling is Incorrect) > Improve Documentation for Sample Method > --- > > Key: SPARK-18365 > URL: https://issues.apache.org/jira/browse/SPARK-18365 > Project: Spark > Issue Type: Bug >Reporter: Bill Chambers > > The parameter documentation is switched. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18365) Documentation for Sampling is Incorrect
Bill Chambers created SPARK-18365: - Summary: Documentation for Sampling is Incorrect Key: SPARK-18365 URL: https://issues.apache.org/jira/browse/SPARK-18365 Project: Spark Issue Type: Bug Reporter: Bill Chambers The parameter documentation is switched. PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-16234: -- Description: resolved... (was: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with a straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists) > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > resolved... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers closed SPARK-16234. - Resolution: Resolved > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > given spark.speculative set to true, I'm running a large spark job with > parquet and savemode overwrite. > Spark will speculatively try to create a task to deal with a straggler. > However, doing this comes with risk because EVEN THOUGH savemode overwrite is > selected, if the straggler completes before the original task or the original > task completes before the straggler then the job will fail due to the file > already existing. > java.io.IOException: > /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet > already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-16234: -- Description: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with a straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists was: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with this straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > given spark.speculative set to true, I'm running a large spark job with > parquet and savemode overwrite. > Spark will speculatively try to create a task to deal with a straggler. > However, doing this comes with risk because EVEN THOUGH savemode overwrite is > selected, if the straggler completes before the original task or the original > task completes before the straggler then the job will fail due to the file > already existing. > java.io.IOException: > /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet > already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16234) Speculative Task may not be able to overwrite file
Bill Chambers created SPARK-16234: - Summary: Speculative Task may not be able to overwrite file Key: SPARK-16234 URL: https://issues.apache.org/jira/browse/SPARK-16234 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Bill Chambers given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with this straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
[ https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351686#comment-15351686 ] Bill Chambers edited comment on SPARK-16220 at 6/27/16 7:53 PM: happy to take a look when it's all done :) was (Author: bill_chambers): [~hvanhovell] I imagine I should just resolve this as well? > Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality > -- > > Key: SPARK-16220 > URL: https://issues.apache.org/jira/browse/SPARK-16220 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Bill Chambers > > After discussing this with [~marmbrus] and [~rxin]. We've decided to revert > SPARK-15663. After doing some research it seems like this is an unnecessary > departure from 1.X functionality and does not have a reasonable substitute > that gives the same functionality. > The first step is to revert the change. After doing that there are a couple > of different ways to approachs to getting at user defined functions. > 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does > this) > 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS > 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar) > 4. SHOW FUNCTIONS + some column to designate if it's system designed or user > defined. > 1. This aligns with previous functionality and then supplements it with > something a bit more specific. > 2. Is unclear because "all" is just unclear why does the default refer to > only user defined functions. This doesn't seem like the right approach. > 3. Same kind of issue, I'm not sure why the user functions should be the > default over the system functions. That doesn't seem like the correct > approach. > 4. This one seems nice because it kind of achieves #1, keeps existing > functionality, but then supplants it with some more. This also allows you, > for example, to create your own set of date functions and then search them > all in one go as opposed to searching system and then user functions. This > would have to return two columns though, which could potentially be an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
[ https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351686#comment-15351686 ] Bill Chambers commented on SPARK-16220: --- [~hvanhovell] I imagine I should just resolve this as well? > Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality > -- > > Key: SPARK-16220 > URL: https://issues.apache.org/jira/browse/SPARK-16220 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Bill Chambers > > After discussing this with [~marmbrus] and [~rxin]. We've decided to revert > SPARK-15663. After doing some research it seems like this is an unnecessary > departure from 1.X functionality and does not have a reasonable substitute > that gives the same functionality. > The first step is to revert the change. After doing that there are a couple > of different ways to approachs to getting at user defined functions. > 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does > this) > 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS > 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar) > 4. SHOW FUNCTIONS + some column to designate if it's system designed or user > defined. > 1. This aligns with previous functionality and then supplements it with > something a bit more specific. > 2. Is unclear because "all" is just unclear why does the default refer to > only user defined functions. This doesn't seem like the right approach. > 3. Same kind of issue, I'm not sure why the user functions should be the > default over the system functions. That doesn't seem like the correct > approach. > 4. This one seems nice because it kind of achieves #1, keeps existing > functionality, but then supplants it with some more. This also allows you, > for example, to create your own set of date functions and then search them > all in one go as opposed to searching system and then user functions. This > would have to return two columns though, which could potentially be an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
Bill Chambers created SPARK-16220: - Summary: Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality Key: SPARK-16220 URL: https://issues.apache.org/jira/browse/SPARK-16220 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.0.0, 2.0.1, 2.1.0 Reporter: Bill Chambers After discussing this with [~marmbrus] and [~rxin]. We've decided to revert SPARK-15663. After doing some research it seems like this is an unnecessary departure from 1.X functionality and does not have a reasonable substitute that gives the same functionality. The first step is to revert the change. After doing that there are a couple of different ways to approachs to getting at user defined functions. 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does this) 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar) 4. SHOW FUNCTIONS + some column to designate if it's system designed or user defined. 1. This aligns with previous functionality and then supplements it with something a bit more specific. 2. Is unclear because "all" is just unclear why does the default refer to only user defined functions. This doesn't seem like the right approach. 3. Same kind of issue, I'm not sure why the user functions should be the default over the system functions. That doesn't seem like the correct approach. 4. This one seems nice because it kind of achieves #1, keeps existing functionality, but then supplants it with some more. This also allows you, for example, to create your own set of date functions and then search them all in one go as opposed to searching system and then user functions. This would have to return two columns though, which could potentially be an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16077) Python UDF may fail because of six
[ https://issues.apache.org/jira/browse/SPARK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340459#comment-15340459 ] Bill Chambers commented on SPARK-16077: --- [~davies] was this one the one that I had reported? > Python UDF may fail because of six > -- > > Key: SPARK-16077 > URL: https://issues.apache.org/jira/browse/SPARK-16077 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Davies Liu > > six or other package may break pickle.whichmodule() in pickle: > https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15264) Spark 2.0 CSV Reader: Error on Blank Column Names
Bill Chambers created SPARK-15264: - Summary: Spark 2.0 CSV Reader: Error on Blank Column Names Key: SPARK-15264 URL: https://issues.apache.org/jira/browse/SPARK-15264 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Bill Chambers When you read in a csv file that starts with blank column names the read fails when you specify that you want a header. Pull request coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14708) Repl Serialization Issue
[ https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246230#comment-15246230 ] Bill Chambers commented on SPARK-14708: --- cc:[~joshrosen] > Repl Serialization Issue > > > Key: SPARK-14708 > URL: https://issues.apache.org/jira/browse/SPARK-14708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Bill Chambers >Priority: Critical > > Run this code 6 times with the :paste command in Spark. You'll see > exponential slow downs. > class IntWrapper(val i: Int) extends Serializable { } > var pairs = sc.parallelize(Array((0, new IntWrapper(0 > for (_ <- 0 until 3) { > val wrapper = pairs.values.reduce((x,_) => x) > pairs = pairs.mapValues(_ => wrapper) > } > val result = pairs.collect() > https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14708) Repl Serialization Issue
[ https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-14708: -- Description: Run this code 6 times with the :paste command in Spark. You'll see exponential slow downs. class IntWrapper(val i: Int) extends Serializable { } var pairs = sc.parallelize(Array((0, new IntWrapper(0 for (_ <- 0 until 3) { val wrapper = pairs.values.reduce((x,_) => x) pairs = pairs.mapValues(_ => wrapper) } val result = pairs.collect() https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html was: Run this code 6 times with the :paste command in Spark. You'll see exponential slow downs. class IntWrapper(val i: Int) extends Serializable { } var pairs = sc.parallelize(Array((0, new IntWrapper(0 for (_ <- 0 until 3) { val wrapper = pairs.values.reduce((x,_) => x) pairs = pairs.mapValues(_ => wrapper) } val result = pairs.collect() https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html > Repl Serialization Issue > > > Key: SPARK-14708 > URL: https://issues.apache.org/jira/browse/SPARK-14708 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Bill Chambers >Priority: Critical > > Run this code 6 times with the :paste command in Spark. You'll see > exponential slow downs. > class IntWrapper(val i: Int) extends Serializable { } > var pairs = sc.parallelize(Array((0, new IntWrapper(0 > for (_ <- 0 until 3) { > val wrapper = pairs.values.reduce((x,_) => x) > pairs = pairs.mapValues(_ => wrapper) > } > val result = pairs.collect() > https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14708) Repl Serialization Issue
Bill Chambers created SPARK-14708: - Summary: Repl Serialization Issue Key: SPARK-14708 URL: https://issues.apache.org/jira/browse/SPARK-14708 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Bill Chambers Priority: Critical Run this code 6 times with the :paste command in Spark. You'll see exponential slow downs. class IntWrapper(val i: Int) extends Serializable { } var pairs = sc.parallelize(Array((0, new IntWrapper(0 for (_ <- 0 until 3) { val wrapper = pairs.values.reduce((x,_) => x) pairs = pairs.mapValues(_ => wrapper) } val result = pairs.collect() https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13214) Update docs to reflect dynamicAllocation to be true
Bill Chambers created SPARK-13214: - Summary: Update docs to reflect dynamicAllocation to be true Key: SPARK-13214 URL: https://issues.apache.org/jira/browse/SPARK-13214 Project: Spark Issue Type: Documentation Reporter: Bill Chambers Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13214) Fix dynamic allocation docs
[ https://issues.apache.org/jira/browse/SPARK-13214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-13214: -- Description: Update docs to reflect dynamicAllocation to be available for all cluster managers Summary: Fix dynamic allocation docs (was: Update docs to reflect dynamicAllocation to be true) > Fix dynamic allocation docs > --- > > Key: SPARK-13214 > URL: https://issues.apache.org/jira/browse/SPARK-13214 > Project: Spark > Issue Type: Documentation >Reporter: Bill Chambers >Priority: Trivial > > Update docs to reflect dynamicAllocation to be available for all cluster > managers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035480#comment-15035480 ] Bill Chambers edited comment on SPARK-11964 at 12/2/15 8:52 AM: quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included, even those that are unresolved, in the new release [and the user guide]? was (Author: bill_chambers): quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new release [and the user guide]? > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035480#comment-15035480 ] Bill Chambers commented on SPARK-11964: --- quick question, am I to assume that all pieces mentioned in this jira: https://issues.apache.org/jira/browse/SPARK-6725 are to be included in the new release [and the user guide]? > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034178#comment-15034178 ] Bill Chambers edited comment on SPARK-11964 at 12/1/15 8:27 PM: Happy to help out with this. Should this belong in a new file or should it just be a part of one that already exists? https://github.com/apache/spark/tree/master/docs -Since [pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md] is its own file, it seems to me that in the guide it might be best to just have a new file. and they would follow one another in [the guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. However, I defer to your judgement! Let me know and I'll try to get it written up today.- It seems like the best place might actually be at the bottom of the ML guide since all of this just refers to the ML api. https://github.com/apache/spark/blob/master/docs/ml-guide.md was (Author: bill_chambers): Happy to help out with this. Should this belong in a new file or should it just be a part of one that already exists? https://github.com/apache/spark/tree/master/docs Since [pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md] is its own file, it seems to me that in the guide it might be best to just have a new file. and they would follow one another in [the guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. However, I defer to your judgement! Let me know and I'll try to get it written up today. > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11964) Create user guide section explaining export/import
[ https://issues.apache.org/jira/browse/SPARK-11964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034178#comment-15034178 ] Bill Chambers commented on SPARK-11964: --- Happy to help out with this. Should this belong in a new file or should it just be a part of one that already exists? https://github.com/apache/spark/tree/master/docs Since [pmml-export|https://github.com/apache/spark/blob/master/docs/mllib-pmml-model-export.md] is its own file, it seems to me that in the guide it might be best to just have a new file. and they would follow one another in [the guide|https://github.com/apache/spark/blob/master/docs/mllib-guide.md]. However, I defer to your judgement! Let me know and I'll try to get it written up today. > Create user guide section explaining export/import > -- > > Key: SPARK-11964 > URL: https://issues.apache.org/jira/browse/SPARK-11964 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley > > I'm envisioning a single section in the main guide explaining how it works > with an example and noting major missing coverage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7130) spark.ml RandomForest* should always do bootstrapping
[ https://issues.apache.org/jira/browse/SPARK-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994307#comment-14994307 ] Bill Chambers commented on SPARK-7130: -- Looking at this issue, the change needs to occur within the [RandomForest File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala] Specifically around [lines 88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88] and 91. I'd like to submit a pull request but want to make sure that there's nothing else I need to be aware of! > spark.ml RandomForest* should always do bootstrapping > - > > Key: SPARK-7130 > URL: https://issues.apache.org/jira/browse/SPARK-7130 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, spark.ml RandomForest does not do bootstrapping if numTrees = 1. > For consistency and a simpler API, it should always do bootstrapping. The > current behavior is an artifact of the old API, in which RandomForest and > DecisionTree share the same implementation. This change should happen after > the implementation is moved to spark.ml (which we need to do so that the > implementation can be generalized). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7130) spark.ml RandomForest* should always do bootstrapping
[ https://issues.apache.org/jira/browse/SPARK-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994307#comment-14994307 ] Bill Chambers edited comment on SPARK-7130 at 11/6/15 7:39 PM: --- Looking at this issue, the change needs to occur within the [RandomForest File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala] Specifically around [lines 88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88] and 91. I'd like to submit a pull request but want to make sure that there's nothing else I need to be aware of! Is there anything else that needs to change? was (Author: bill_chambers): Looking at this issue, the change needs to occur within the [RandomForest File|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala] Specifically around [lines 88|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88] and 91. I'd like to submit a pull request but want to make sure that there's nothing else I need to be aware of! > spark.ml RandomForest* should always do bootstrapping > - > > Key: SPARK-7130 > URL: https://issues.apache.org/jira/browse/SPARK-7130 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, spark.ml RandomForest does not do bootstrapping if numTrees = 1. > For consistency and a simpler API, it should always do bootstrapping. The > current behavior is an artifact of the old API, in which RandomForest and > DecisionTree share the same implementation. This change should happen after > the implementation is moved to spark.ml (which we need to do so that the > implementation can be generalized). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741715#comment-14741715 ] Bill Chambers edited comment on SPARK-10528 at 9/11/15 11:17 PM: - This came up for me when I used the spark_ec2 launcher. When I tried to enter the spark shell I received the same error on AWS. Running: ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive allowed the SQLContext to get pulled in and created correctly. It's a workaround for now, but something that might want to be fixed in the future. was (Author: bill_chambers): This came up for me when I used the spark_ec2 launcher. When I tried to enter the spark shell I received the same error. Running: ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive allowed the SQLContext to get pulled in and created correctly. It's a workaround for now, but something that might want to be fixed in the future. > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741715#comment-14741715 ] Bill Chambers commented on SPARK-10528: --- This came up for me when I used the spark_ec2 launcher. When I tried to enter the spark shell I received the same error. Running: ephermeral-hdfs/bin/hadoop fs -chmod 777 /tmp/hive allowed the SQLContext to get pulled in and created correctly. It's a workaround for now, but something that might want to be fixed in the future. > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org