[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927539#comment-15927539 ] Kay Ousterhout commented on SPARK-19803: Awesome thanks! > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.
[ https://issues.apache.org/jira/browse/SPARK-19968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-19968: Description: KafkaProducer is thread safe and an instance can be reused for writing every batch out. According to Kafka docs, this sort of usage is encouraged. On an average an addBatch operation takes 25ms with this patch and 250+ ms without this patch. TODO: Post my results from benchmark comparison. was: KafkaProducer is thread safe and an instance can be reused for writing every batch out. According to Kafka docs, this sort of usage is encouraged. TODO: Post my results from benchmark comparison. > Use a cached instance of KafkaProducer for writing to kafka via KafkaSink. > -- > > Key: SPARK-19968 > URL: https://issues.apache.org/jira/browse/SPARK-19968 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Labels: kafka > > KafkaProducer is thread safe and an instance can be reused for writing every > batch out. According to Kafka docs, this sort of usage is encouraged. > On an average an addBatch operation takes 25ms with this patch and 250+ ms > without this patch. > TODO: Post my results from benchmark comparison. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.
[ https://issues.apache.org/jira/browse/SPARK-19968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-19968: --- Assignee: Prashant Sharma > Use a cached instance of KafkaProducer for writing to kafka via KafkaSink. > -- > > Key: SPARK-19968 > URL: https://issues.apache.org/jira/browse/SPARK-19968 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Labels: kafka > > KafkaProducer is thread safe and an instance can be reused for writing every > batch out. According to Kafka docs, this sort of usage is encouraged. > TODO: Post my results from benchmark comparison. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19968) Use a cached instance of KafkaProducer for writing to kafka via KafkaSink.
Prashant Sharma created SPARK-19968: --- Summary: Use a cached instance of KafkaProducer for writing to kafka via KafkaSink. Key: SPARK-19968 URL: https://issues.apache.org/jira/browse/SPARK-19968 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Prashant Sharma KafkaProducer is thread safe and an instance can be reused for writing every batch out. According to Kafka docs, this sort of usage is encouraged. TODO: Post my results from benchmark comparison. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927510#comment-15927510 ] Sean Owen commented on SPARK-19964: --- It doesn't fail in master. This sounds like the kind of thing that fails when you have old builds lying around - could be local? > SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19967: Description: The method "from_json" is a useful method in turning a string column into a nested StructType with a user specified schema. The schema should be specified in the DDL format was:The method "from_json" is a useful method in turning a string column into a nested StructType with a user specified schema. The schema should be specified in the DDL format for declaring the table schema. > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19898) Add from_json in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-19898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-19898. --- Resolution: Duplicate > Add from_json in FunctionRegistry > - > > Key: SPARK-19898 > URL: https://issues.apache.org/jira/browse/SPARK-19898 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In SPARK-19637, `to_json` has been added in FunctionRegistry. As for > `from_json`, we do not add yet because how users specify schema as SQL > strings is not clear (See https://github.com/apache/spark/pull/16981 for > related discussion). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927485#comment-15927485 ] Takeshi Yamamuro commented on SPARK-19967: -- oh, looks great! Thanks! > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927478#comment-15927478 ] Xiao Li commented on SPARK-19967: - See the PR https://github.com/apache/spark/pull/17171 It is just merged very recently. > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927476#comment-15927476 ] Takeshi Yamamuro commented on SPARK-19967: -- Finally, what's the DDL format for schema? In my previous pr, I just used the format parsed by DataType.fromJson() https://github.com/apache/spark/compare/master...maropu:SPARK-19637-2 > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927466#comment-15927466 ] Takeshi Yamamuro commented on SPARK-19967: -- Yea, sure! I'll make a pr in a day! Thanks. BTW I've already file a JIRA for this a few days ago: https://issues.apache.org/jira/browse/SPARK-19898 > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927462#comment-15927462 ] Xiao Li commented on SPARK-19967: - cc [~maropu] > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19967) Add from_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19967: Description: The method "from_json" is a useful method in turning a string column into a nested StructType with a user specified schema. The schema should be specified in the DDL format for declaring the table schema. > Add from_json APIs to SQL > - > > Key: SPARK-19967 > URL: https://issues.apache.org/jira/browse/SPARK-19967 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > The method "from_json" is a useful method in turning a string column into a > nested StructType with a user specified schema. The schema should be > specified in the DDL format for declaring the table schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19967) Add from_json APIs to SQL
Xiao Li created SPARK-19967: --- Summary: Add from_json APIs to SQL Key: SPARK-19967 URL: https://issues.apache.org/jira/browse/SPARK-19967 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19830) Add parseTableSchema API to ParserInterface
[ https://issues.apache.org/jira/browse/SPARK-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19830. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17171 [https://github.com/apache/spark/pull/17171] > Add parseTableSchema API to ParserInterface > --- > > Key: SPARK-19830 > URL: https://issues.apache.org/jira/browse/SPARK-19830 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > Specifying the table schema in DDL formats is needed for different scenarios. > For example, specifying the schema in SQL function {{from_json}}, and > specifying the customized JDBC data types. In the submitted PRs, the idea is > to ask users to specify the table schema in the JSON format. This is not user > friendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
[ https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927431#comment-15927431 ] Liwei Lin commented on SPARK-19965: --- Hi [~zsxwing], are you working on a patch? Mind if I work on this in case you haven't started your work? > DataFrame batch reader may fail to infer partitions when reading > FileStreamSink's output > > > Key: SPARK-19965 > URL: https://issues.apache.org/jira/browse/SPARK-19965 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu > > Reproducer > {code} > test("partitioned writing and batch reading with 'basePath'") { > val inputData = MemoryStream[Int] > val ds = inputData.toDS() > val outputDir = Utils.createTempDir(namePrefix = > "stream.output").getCanonicalPath > val checkpointDir = Utils.createTempDir(namePrefix = > "stream.checkpoint").getCanonicalPath > var query: StreamingQuery = null > try { > query = > ds.map(i => (i, i * 1000)) > .toDF("id", "value") > .writeStream > .partitionBy("id") > .option("checkpointLocation", checkpointDir) > .format("parquet") > .start(outputDir) > inputData.addData(1, 2, 3) > failAfter(streamingTimeout) { > query.processAllAvailable() > } > spark.read.option("basePath", outputDir).parquet(outputDir + > "/*").show() > } finally { > if (query != null) { > query.stop() > } > } > } > {code} > Stack trace > {code} > [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** > (3 seconds, 928 milliseconds) > [info] java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths: > [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 > [info] > ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata > [info] > [info] If provided paths are partition directories, please set "basePath" in > the options of the data source to specify the root directory of the table. If > there are multiple root directories, please load them separately and then > union them. > [info] at scala.Predef$.assert(Predef.scala:170) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) > [info] at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > [info] at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927409#comment-15927409 ] Takeshi Yamamuro commented on SPARK-18591: -- yea, sure. Have we already filed a JIRA for the planner improvement? > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927396#comment-15927396 ] Shubham Chopra commented on SPARK-19803: I am looking into this and will try to submit a fix in a day or so. Mostly trying to isolate the race condition and simplify the test cases. > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927388#comment-15927388 ] Wenchen Fan commented on SPARK-18591: - Can we hold it for a while? I think we need to improve the planner first. We should either move back to bottom-up planning, or add a physical optimization phase. We need to figure out a better way to do planning > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"
[ https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927385#comment-15927385 ] Imran Rashid commented on SPARK-7420: - This just [failed again|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74629/testReport/org.apache.spark.streaming.scheduler/JobGeneratorSuite/SPARK_6222__Do_not_clear_received_block_data_too_soon/] {noformat} sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 107 times over 10.24840867699 seconds. Last failure message: getBlocksOfBatch(batchTime).nonEmpty was false. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:336) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.org$apache$spark$streaming$scheduler$JobGeneratorSuite$$anonfun$$anonfun$$waitForBlocksToBeAllocatedToBatch$1(JobGeneratorSuite.scala:105) at org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$1.apply$mcVI$sp(JobGeneratorSuite.scala:114) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(JobGeneratorSuite.scala:111) at org.apache.spark.streaming.scheduler.JobGeneratorSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(JobGeneratorSuite.scala:66) at org.apache.spark.streaming.TestSuiteBase$class.withStreamingContext(TestSuiteBase.scala:279) at ... {noformat} I'll also attach the relevant snippet from {{streaming/unit-test.log}} > Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block > data too soon" > - > > Key: SPARK-7420 > URL: https://issues.apache.org/jira/browse/SPARK-7420 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.3.1, 1.4.0 >Reporter: Andrew Or >Assignee: Tathagata Das >Priority: Critical > Labels: flaky-test > Attachments: trimmed-unit-tests-74629.log > > > {code} > The code passed to eventually never returned normally. Attempted 18 times > over 10.13803606001 seconds. Last failure message: > receiverTracker.hasUnallocatedBlocks was false. > {code} > It seems to be failing only in maven. > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"
[ https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-7420: Attachment: trimmed-unit-tests-74629.log > Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block > data too soon" > - > > Key: SPARK-7420 > URL: https://issues.apache.org/jira/browse/SPARK-7420 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.3.1, 1.4.0 >Reporter: Andrew Or >Assignee: Tathagata Das >Priority: Critical > Labels: flaky-test > Attachments: trimmed-unit-tests-74629.log > > > {code} > The code passed to eventually never returned normally. Attempted 18 times > over 10.13803606001 seconds. Last failure message: > receiverTracker.hasUnallocatedBlocks was false. > {code} > It seems to be failing only in maven. > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-7420) Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block data too soon"
[ https://issues.apache.org/jira/browse/SPARK-7420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reopened SPARK-7420: - Assignee: (was: Tathagata Das) > Flaky test: o.a.s.streaming.JobGeneratorSuite "Do not clear received block > data too soon" > - > > Key: SPARK-7420 > URL: https://issues.apache.org/jira/browse/SPARK-7420 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.3.1, 1.4.0 >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > Attachments: trimmed-unit-tests-74629.log > > > {code} > The code passed to eventually never returned normally. Attempted 18 times > over 10.13803606001 seconds. Last failure message: > receiverTracker.hasUnallocatedBlocks was false. > {code} > It seems to be failing only in maven. > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/458/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/459/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2173/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 吴志龙 updated SPARK-19966: Environment: redhot spark on yarn client was: redhot spark on yarn client > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate > --- > > Key: SPARK-19966 > URL: https://issues.apache.org/jira/browse/SPARK-19966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: redhot > spark on yarn client >Reporter: 吴志龙 > > 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table > dp_tmp.tmp_20783_20170316; > ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) > >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 吴志龙 updated SPARK-19966: Environment: redhot spark on yarn client was: hadoop 2.6.0 jdk 1.7 spark 2.1 spark on yarn client > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate > --- > > Key: SPARK-19966 > URL: https://issues.apache.org/jira/browse/SPARK-19966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: redhot spark on yarn client >Reporter: 吴志龙 > > 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table > dp_tmp.tmp_20783_20170316; > ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) > >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-19966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 吴志龙 updated SPARK-19966: Description: 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table dp_tmp.tmp_20783_20170316; ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption was: 2017-03-16 10:17:47|GET_HIVE_DATA|删除临时表|drop table dp_tmp.tmp_20783_20170316; ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) >FAILED|GET_HIVE_DATA|[[[0]]]条,有异常 > SparkListenerBus has already stopped! Dropping event > SparkListenerExecutorMetricsUpdate > --- > > Key: SPARK-19966 > URL: https://issues.apache.org/jira/browse/SPARK-19966 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: hadoop 2.6.0 > jdk 1.7 > spark 2.1 > spark on yarn client >Reporter: 吴志龙 > > 2017-03-16 10:17:47|GET_HIVE_DATA|delete table |drop table > dp_tmp.tmp_20783_20170316; > ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) > >FAILED|GET_HIVE_DATA|[[[0]]] ,has execption -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19966) SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate
吴志龙 created SPARK-19966: --- Summary: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate Key: SPARK-19966 URL: https://issues.apache.org/jira/browse/SPARK-19966 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Environment: hadoop 2.6.0 jdk 1.7 spark 2.1 spark on yarn client Reporter: 吴志龙 2017-03-16 10:17:47|GET_HIVE_DATA|删除临时表|drop table dp_tmp.tmp_20783_20170316; ERROR|2017-03-16 10:17:44|LiveListenerBus|SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(3,WrappedArray()) >FAILED|GET_HIVE_DATA|[[[0]]]条,有异常 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
[ https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927333#comment-15927333 ] Shixiong Zhu commented on SPARK-19965: -- This is because inferring partitions doesn't ignore the "_spark_metadata" folder. > DataFrame batch reader may fail to infer partitions when reading > FileStreamSink's output > > > Key: SPARK-19965 > URL: https://issues.apache.org/jira/browse/SPARK-19965 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu > > Reproducer > {code} > test("partitioned writing and batch reading with 'basePath'") { > val inputData = MemoryStream[Int] > val ds = inputData.toDS() > val outputDir = Utils.createTempDir(namePrefix = > "stream.output").getCanonicalPath > val checkpointDir = Utils.createTempDir(namePrefix = > "stream.checkpoint").getCanonicalPath > var query: StreamingQuery = null > try { > query = > ds.map(i => (i, i * 1000)) > .toDF("id", "value") > .writeStream > .partitionBy("id") > .option("checkpointLocation", checkpointDir) > .format("parquet") > .start(outputDir) > inputData.addData(1, 2, 3) > failAfter(streamingTimeout) { > query.processAllAvailable() > } > spark.read.option("basePath", outputDir).parquet(outputDir + > "/*").show() > } finally { > if (query != null) { > query.stop() > } > } > } > {code} > Stack trace > {code} > [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** > (3 seconds, 928 milliseconds) > [info] java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths: > [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 > [info] > ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata > [info] > [info] If provided paths are partition directories, please set "basePath" in > the options of the data source to specify the root directory of the table. If > there are multiple root directories, please load them separately and then > union them. > [info] at scala.Predef$.assert(Predef.scala:170) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) > [info] at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > [info] at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
Shixiong Zhu created SPARK-19965: Summary: DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output Key: SPARK-19965 URL: https://issues.apache.org/jira/browse/SPARK-19965 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Shixiong Zhu Reproducer {code} test("partitioned writing and batch reading with 'basePath'") { val inputData = MemoryStream[Int] val ds = inputData.toDS() val outputDir = Utils.createTempDir(namePrefix = "stream.output").getCanonicalPath val checkpointDir = Utils.createTempDir(namePrefix = "stream.checkpoint").getCanonicalPath var query: StreamingQuery = null try { query = ds.map(i => (i, i * 1000)) .toDF("id", "value") .writeStream .partitionBy("id") .option("checkpointLocation", checkpointDir) .format("parquet") .start(outputDir) inputData.addData(1, 2, 3) failAfter(streamingTimeout) { query.processAllAvailable() } spark.read.option("basePath", outputDir).parquet(outputDir + "/*").show() } finally { if (query != null) { query.stop() } } } {code} Stack trace {code} [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** (3 seconds, 928 milliseconds) [info] java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 [info] ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata [info] [info] If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them. [info] at scala.Predef$.assert(Predef.scala:170) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) [info] at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) [info] at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) [info] at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) [info] at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) [info] at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) [info] at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) [info] at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) [info] at org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927263#comment-15927263 ] Kay Ousterhout commented on SPARK-19803: This failed again today: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74621/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___3_replicas___2_block_manager_deletions/ > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19751) Create Data frame API fails with a self referencing bean
[ https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19751. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17188 [https://github.com/apache/spark/pull/17188] > Create Data frame API fails with a self referencing bean > > > Key: SPARK-19751 > URL: https://issues.apache.org/jira/browse/SPARK-19751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Avinash Venkateshaiah >Priority: Minor > Fix For: 2.2.0 > > > createDataset API throws a stack overflow exception when we try creating a > Dataset using a bean encoder. The bean is self referencing > BEAN: > public class HierObj implements Serializable { > String name; > List children; > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > public List getChildren() { > return children; > } > public void setChildren(List children) { > this.children = children; > } > } > // create an object > HierObj hierObj = new HierObj(); > hierObj.setName("parent"); > List children = new ArrayList(); > HierObj child1 = new HierObj(); > child1.setName("child1"); > HierObj child2 = new HierObj(); > child2.setName("child2"); > children.add(child1); > children.add(child2); > hierObj.setChildren(children); > // create a dataset > Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), > Encoders.bean(HierObj.class)); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean
[ https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19751: --- Assignee: Takeshi Yamamuro > Create Data frame API fails with a self referencing bean > > > Key: SPARK-19751 > URL: https://issues.apache.org/jira/browse/SPARK-19751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Avinash Venkateshaiah >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.0 > > > createDataset API throws a stack overflow exception when we try creating a > Dataset using a bean encoder. The bean is self referencing > BEAN: > public class HierObj implements Serializable { > String name; > List children; > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > public List getChildren() { > return children; > } > public void setChildren(List children) { > this.children = children; > } > } > // create an object > HierObj hierObj = new HierObj(); > hierObj.setName("parent"); > List children = new ArrayList(); > HierObj child1 = new HierObj(); > child1.setName("child1"); > HierObj child2 = new HierObj(); > child2.setName("child2"); > children.add(child1); > children.add(child2); > hierObj.setChildren(children); > // create a dataset > Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), > Encoders.bean(HierObj.class)); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19961) unify a exception erro msg for dropdatabase
[ https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19961. - Resolution: Fixed Assignee: Song Jun Fix Version/s: 2.2.0 > unify a exception erro msg for dropdatabase > --- > > Key: SPARK-19961 > URL: https://issues.apache.org/jira/browse/SPARK-19961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Song Jun >Assignee: Song Jun >Priority: Trivial > Fix For: 2.2.0 > > > unify a exception erro msg for dropdatabase when the database still have some > tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19948) Document that saveAsTable uses catalog as source of truth for table existence.
[ https://issues.apache.org/jira/browse/SPARK-19948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19948: --- Assignee: Juliusz Sompolski > Document that saveAsTable uses catalog as source of truth for table existence. > -- > > Key: SPARK-19948 > URL: https://issues.apache.org/jira/browse/SPARK-19948 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski > Fix For: 2.2.0 > > > It is quirky behaviour that saveAsTable to e.g. a JDBC source with SaveMode > other than Overwrite will nevertheless overwrite the table in the external > source,if that table was not a catalog table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19948) Document that saveAsTable uses catalog as source of truth for table existence.
[ https://issues.apache.org/jira/browse/SPARK-19948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19948. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17289 [https://github.com/apache/spark/pull/17289] > Document that saveAsTable uses catalog as source of truth for table existence. > -- > > Key: SPARK-19948 > URL: https://issues.apache.org/jira/browse/SPARK-19948 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Juliusz Sompolski > Fix For: 2.2.0 > > > It is quirky behaviour that saveAsTable to e.g. a JDBC source with SaveMode > other than Overwrite will nevertheless overwrite the table in the external > source,if that table was not a catalog table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19931) InMemoryTableScanExec should rewrite output partitioning and ordering when aliasing output attributes
[ https://issues.apache.org/jira/browse/SPARK-19931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19931: --- Assignee: Liang-Chi Hsieh > InMemoryTableScanExec should rewrite output partitioning and ordering when > aliasing output attributes > - > > Key: SPARK-19931 > URL: https://issues.apache.org/jira/browse/SPARK-19931 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > Now InMemoryTableScanExec simply takes the outputPartitioning and > outputOrdering from the associated InMemoryRelation's > child.outputPartitioning and outputOrdering. > However, InMemoryTableScanExec can alias the output attributes. In this case, > its outputPartitioning and outputOrdering are not correct and its parent > operators can't correctly determine its data distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19931) InMemoryTableScanExec should rewrite output partitioning and ordering when aliasing output attributes
[ https://issues.apache.org/jira/browse/SPARK-19931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19931. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17175 [https://github.com/apache/spark/pull/17175] > InMemoryTableScanExec should rewrite output partitioning and ordering when > aliasing output attributes > - > > Key: SPARK-19931 > URL: https://issues.apache.org/jira/browse/SPARK-19931 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.2.0 > > > Now InMemoryTableScanExec simply takes the outputPartitioning and > outputOrdering from the associated InMemoryRelation's > child.outputPartitioning and outputOrdering. > However, InMemoryTableScanExec can alias the output attributes. In this case, > its outputPartitioning and outputOrdering are not correct and its parent > operators can't correctly determine its data distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)
[ https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927240#comment-15927240 ] Yash Sharma commented on SPARK-7736: This does not seem Fixed. The application still completes with SUCCESS status even when an exception is thrown from the application. Spark version 2.0.2. > Exception not failing Python applications (in yarn cluster mode) > > > Key: SPARK-7736 > URL: https://issues.apache.org/jira/browse/SPARK-7736 > Project: Spark > Issue Type: Bug > Components: YARN > Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04 >Reporter: Shay Rojansky >Assignee: Marcelo Vanzin > Fix For: 1.5.1, 1.6.0 > > > It seems that exceptions thrown in Python spark apps after the SparkContext > is instantiated don't cause the application to fail, at least in Yarn: the > application is marked as SUCCEEDED. > Note that any exception right before the SparkContext correctly places the > application in FAILED state. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18066. Resolution: Fixed Assignee: Eren Avsarogullari Fix Version/s: 2.2.0 > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Minor > Fix For: 2.2.0 > > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927122#comment-15927122 ] Hyukjin Kwon commented on SPARK-19872: -- Yup, test was added. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927111#comment-15927111 ] Hyukjin Kwon commented on SPARK-19954: -- Honestly, we don't usually set {{Blocker}} {quote} Priority. Set to Major or below; higher priorities are generally reserved for committers to set {quote} I don't want to repeat what is written in the guide lines multiple times in the future. > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927103#comment-15927103 ] Hyukjin Kwon commented on SPARK-19954: -- I was following "Contributing to JIRA Maintenance" http://spark.apache.org/contributing.html {quote} For issues that can’t be reproduced against master as reported, resolve as Cannot Reproduce {quote} Now I can identify it. I am resolving this as {{Duplicate}} as written {quote} If the issue is the same as or a subset of another issue, resolved as Duplicate {quote} > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-19954: -- > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19954. -- Resolution: Duplicate > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13369) Number of consecutive fetch failures for a stage before the job is aborted should be configurable
[ https://issues.apache.org/jira/browse/SPARK-13369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927085#comment-15927085 ] Apache Spark commented on SPARK-13369: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/17307 > Number of consecutive fetch failures for a stage before the job is aborted > should be configurable > -- > > Key: SPARK-13369 > URL: https://issues.apache.org/jira/browse/SPARK-13369 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Sital Kedia >Priority: Minor > > Currently it is hardcode inside code. We need to make it configurable because > for long running jobs, the chances of fetch failures due to machine reboot is > high and we need a configuration parameter to bump up that number. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927071#comment-15927071 ] Herman van Hovell commented on SPARK-19954: --- This is reproducible on Spark 2.1. This is caused by a bug in the {{ConstantFolding}} optimizer rule, and this is has been fixed in SPARK-18300. I think we will be doing a 2.1.x maintenance release soon, and that will contain this fix. > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`
[ https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-19960. - Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.2.0 > Move `SparkHadoopWriter` to `internal/io/` > -- > > Key: SPARK-19960 > URL: https://issues.apache.org/jira/browse/SPARK-19960 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > We should move `SparkHadoopWriter` to `internal/io/`, that will make it > easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926992#comment-15926992 ] Brian Bruggeman commented on SPARK-19872: - Wondering if a test will be added to prevent future regressions. > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926882#comment-15926882 ] Arun Allamsetty commented on SPARK-19954: - [~maropu] and [~hyukjin.kwon] were you able to reproduce the issue on the released Spark 2.1 though? > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19964) SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-19964: --- Attachment: SparkSubmitSuite_Stacktrace > SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19964) SparkSubmitSuite fails due to Timeout
Eren Avsarogullari created SPARK-19964: -- Summary: SparkSubmitSuite fails due to Timeout Key: SPARK-19964 URL: https://issues.apache.org/jira/browse/SPARK-19964 Project: Spark Issue Type: Bug Components: Deploy, Tests Affects Versions: 2.2.0 Reporter: Eren Avsarogullari The following test case has been failed due to TestFailedDueToTimeoutException *Test Suite:* SparkSubmitSuite *Test Case:* includes jars passed in through --packages https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-13450. --- Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.2.0 > SortMergeJoin will OOM when join rows have lot of same keys > --- > > Key: SPARK-13450 > URL: https://issues.apache.org/jira/browse/SPARK-13450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.2, 2.1.0 >Reporter: Hong Shen >Assignee: Tejas Patil > Fix For: 2.2.0 > > Attachments: heap-dump-analysis.png > > > When I run a sql with join, task throw java.lang.OutOfMemoryError and sql > failed. I have set spark.executor.memory 4096m. > SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if > the join rows have a lot of same key, it will throw OutOfMemoryError. > {code} > /** Buffered rows from the buffered side of the join. This is empty if > there are no matches. */ > private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new > ArrayBuffer[InternalRow] > {code} > Here is the stackTrace: > {code} > org.xerial.snappy.SnappyNative.arrayCopy(Native Method) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:89) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19963) create view from select Fails when nullif() is used
Jay Danielsen created SPARK-19963: - Summary: create view from select Fails when nullif() is used Key: SPARK-19963 URL: https://issues.apache.org/jira/browse/SPARK-19963 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Jay Danielsen Priority: Minor Test Case : Any valid query using nullif. SELECT nullif(mycol,0) from mytable; Create view FAILS when nullif used in select. CREATE VIEW my_view as SELECT nullif(mycol,0) from mytable; Error: java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ... I can refactor with CASE statement and create view successfully. CREATE VIEW my_view as SELECT CASE WHEN mycol = 0 THEN NULL ELSE mycol END mycol from mytable; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926723#comment-15926723 ] yu peng edited comment on SPARK-19962 at 3/15/17 6:42 PM: -- yeah, with one indexer that performing like mutually exclusive n "indexers". i can have a sparse version of the output features fairly easy. let's say if i want to train a classifer for gender based on age, country, hobbies. with DictVectorizor, it's going to be vec = DictVectorizor(sparse=True, inputColumns=['age', 'country', 'hobbies'], outputColumn='feature') df= vec.fit_transform(df) df= Binarizer(threshold=1.0, inputCol="_", outputCol="label").fit_transform(StringIndexer(inputColumn='gender', outputCol='_').fit_transform(df['gender']) rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", numTrees=10) rf.fit(df) was (Author: yupbank): yeah, with one indexer that performing like mutually exclusive n "indexers". i can have a sparse version of the output features fairly easy. let's say if i want to train a classifer for gender based on age, country, hobbies. with DictVectorizor, it's going to be vec = DictVectorizor(sparse=True) df['fetures'] = vec.fit_transform(df[['age', 'country', 'hobbies']]) df['label'] = OnehotEncoder().fit_transform(StringIndexer().fit_transform(df['gender']) rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", numTrees=10) rf.fit(df) > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926723#comment-15926723 ] yu peng commented on SPARK-19962: - yeah, with one indexer that performing like mutually exclusive n "indexers". i can have a sparse version of the output features fairly easy. let's say if i want to train a classifer for gender based on age, country, hobbies. with DictVectorizor, it's going to be vec = DictVectorizor(sparse=True) df['fetures'] = vec.fit_transform(df[['age', 'country', 'hobbies']]) df['label'] = OnehotEncoder().fit_transform(StringIndexer().fit_transform(df['gender']) rf = RandomForestClassifier(labelCol="label", featuresCol="fetures", numTrees=10) rf.fit(df) > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926680#comment-15926680 ] Sean Owen commented on SPARK-19962: --- You would have n indexers and n models for n different columns; you need n mappings no matter what. I'm missing the distinction here. Why would you need the encodings of n different columns to be mutually exclusive -- if that's the point of the example? > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19872. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-19872: -- Assignee: Hyukjin Kwon > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926559#comment-15926559 ] yu peng commented on SPARK-19962: - >Well, it's maintained by the StringIndexerModel for you. no i mean the structure across different columns. say i want to get back the third column from `[1, 0, 1, 0, 1, 1, 1]` to `gender=male` StringIndexerModel only maintain one column based input mapping. yeah, any categorical should be able to converted to a string. but the idea is maintain dimension mapping across different StringIndexers. > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yu peng updated SPARK-19962: Description: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', hobbies=['sing']), ]) df.show() |age|gender|country|hobbies| |1|male|cn|[sing, dance]| |3|female|us|[sing]| import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1, 1, 1]| |[3, 1, 0, 1, 0, 1, 1]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| |hobbies=sing|5| |hobbies=dance|6| ``` was: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', hobbies=['sing']), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1, 1, 1]| |[3, 1, 0, 1, 0, 1, 1]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| |hobbies=sing|5| |hobbies=dance|6| ``` > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926543#comment-15926543 ] Sean Owen commented on SPARK-19962: --- Well, it's maintained by the StringIndexerModel for you. What's your non-integer, non-string use case that can't be converted to a string but is categorical? > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926537#comment-15926537 ] yu peng commented on SPARK-19962: - yeah.. StringIndexer does some job on single column and only string type. the relationship between columns and feature names need to be maintained by some complex data structure. ^i have updated the description again add some enumerate features. > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yu peng updated SPARK-19962: Description: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', hobbies=['sing']), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1, 1, 1]| |[3, 1, 0, 1, 0, 1, 1]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| |hobbies=sing|5| |hobbies=dance|6| ``` was: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn'),Row(age=3, gender='female', country='us'), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1]| |[3, 1, 0, 1, 0]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| ``` > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926518#comment-15926518 ] Sean Owen commented on SPARK-19962: --- That sounds like what https://spark.apache.org/docs/latest/ml-features.html#stringindexer does > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', > country='cn'),Row(age=3, gender='female', country='us'), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1]| > |[3, 1, 0, 1, 0]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926458#comment-15926458 ] yu peng commented on SPARK-19962: - ^ i have updated the description > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', > country='cn'),Row(age=3, gender='female', country='us'), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1]| > |[3, 1, 0, 1, 0]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yu peng updated SPARK-19962: Description: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. something like ``` df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn'),Row(age=3, gender='female', country='us'), ]) import DictVectorizor vec = DictVectorizor() matrix = vec.fit_transform(df) matrix.show() |features| |[1, 0, 1, 0, 1]| |[3, 1, 0, 1, 0]| vec.show() |feature_name| feature_dimension| |age|0| |gender=female|1| |gender=male|2| |country=us|3| |country=cn|4| ``` was: it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', > country='cn'),Row(age=3, gender='female', country='us'), ]) > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1]| > |[3, 1, 0, 1, 0]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926403#comment-15926403 ] Imran Rashid edited comment on SPARK-12297 at 3/15/17 3:53 PM: --- To expand on the original description-- Spark copied Hive's behavior for parquet, but this was inconsistent with Impala (which is the original source of putting a timestamp as an int96 in parquet, I believe), and inconsistent with other file formats. This made timestamps in parquet act more like timestamps with timezones, while in other file formats, timestamps have no time zone, they are a "floating time". The easiest way to see this issue is to write out a table with timestamps in multiple different formats from one timezone, then try to read them back in another timezone. Eg., here I write out a few timestamps to parquet and textfile hive tables, and also just as a json file, all in the "America/Los_Angeles" timezone: {code} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val tblPrefix = args(0) val schema = new StructType().add("ts", TimestampType) val rows = sc.parallelize(Seq( "2015-12-31 23:50:59.123", "2015-12-31 22:49:59.123", "2016-01-01 00:39:59.123", "2016-01-01 01:29:59.123" ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) val rawData = spark.createDataFrame(rows, schema).toDF() rawData.show() Seq("parquet", "textfile").foreach { format => val tblName = s"${tblPrefix}_$format" spark.sql(s"DROP TABLE IF EXISTS $tblName") spark.sql( raw"""CREATE TABLE $tblName ( | ts timestamp | ) | STORED AS $format """.stripMargin) rawData.write.insertInto(tblName) } rawData.write.json(s"${tblPrefix}_json") {code} Then I start a spark-shell in "America/New_York" timezone, and read the data back from each table: {code} scala> spark.sql("select * from la_parquet").collect().foreach{println} [2016-01-01 02:50:59.123] [2016-01-01 01:49:59.123] [2016-01-01 03:39:59.123] [2016-01-01 04:29:59.123] scala> spark.sql("select * from la_textfile").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01 00:39:59.123] [2016-01-01 01:29:59.123] scala> spark.read.json("la_json").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01 00:39:59.123] [2016-01-01 01:29:59.123] scala> spark.read.json("la_json").join(spark.sql("select * from la_textfile"), "ts").show() ++ | ts| ++ |2015-12-31 23:50:...| |2015-12-31 22:49:...| |2016-01-01 00:39:...| |2016-01-01 01:29:...| ++ scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), "ts").show() +---+ | ts| +---+ +---+ {code} The textfile and json based data shows the same times, and can be joined against each other, while the times from the parquet data have changed (and obviously joins fail). This is a big problem for any organization that may try to read the same data (say in S3) with clusters in multiple timezones. It can also be a nasty surprise as an organization tries to migrate file formats. Finally, its a source of incompatibility between Hive, Impala, and Spark. HIVE-12767 aims to fix this by introducing a table property which indicates the "storage timezone" for the table. Spark should add the same to ensure consistency between file formats and with Hive & Impala. [~rdblue] unless you have any objections I'll update the bug description to reflect this. was (Author: irashid): To expand on the original description-- Spark copied Hive's behavior for parquet, but this was inconsistent with Impala (which is the original source of putting a timestamp as an int96 in parquet, I believe), and inconsistent with other file formats. This made timestamps in parquet act more like timestamps with timezones, while in other file formats, timestamps have no time zone, they are a "floating time". The easiest way to see this issue is to write out a table with timestamps in multiple different formats from one timezone, then try to read them back in another timezone. Eg., here I write out a few timestamps to parquet and textfile hive tables, and also just as a json file, all in the "America/Los_Angeles" timezone: {code} {code} Then I start a spark-shell in "America/New_York" timezone, and read the data back from each table: {code} scala> spark.sql("select * from la_parquet").collect().foreach{println} [2016-01-01 02:50:59.123] [2016-01-01 01:49:59.123] [2016-01-01 03:39:59.123] [2016-01-01 04:29:59.123] scala> spark.sql("select * from la_textfile").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01 00:39:59.123] [2016-01-01 01:29:59.123] scala> spark.read.json("la_json").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01
[jira] [Closed] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib
[ https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang closed SPARK-19957. -- Resolution: Not A Problem > Inconsist KMeans initialization mode behavior between ML and MLlib > -- > > Key: SPARK-19957 > URL: https://issues.apache.org/jira/browse/SPARK-19957 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: yuhao yang >Priority: Minor > > when users set the initialization mode to "random", KMeans in ML and MLlib > has inconsistent behavior for multiple runs: > MLlib will basically use new Random for each run. > ML Kmeans however will use the default random seed, which is > {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same > number among multiple fitting. > I would expect the "random" initialization mode to be literally random. > There're different solutions with different scope of impact. Adjusting the > hasSeed trait may have a broader impact(but maybe worth discussion). We can > always just set random default seed in KMeans. > Appreciate your feedback. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926403#comment-15926403 ] Imran Rashid commented on SPARK-12297: -- To expand on the original description-- Spark copied Hive's behavior for parquet, but this was inconsistent with Impala (which is the original source of putting a timestamp as an int96 in parquet, I believe), and inconsistent with other file formats. This made timestamps in parquet act more like timestamps with timezones, while in other file formats, timestamps have no time zone, they are a "floating time". The easiest way to see this issue is to write out a table with timestamps in multiple different formats from one timezone, then try to read them back in another timezone. Eg., here I write out a few timestamps to parquet and textfile hive tables, and also just as a json file, all in the "America/Los_Angeles" timezone: {code} {code} Then I start a spark-shell in "America/New_York" timezone, and read the data back from each table: {code} scala> spark.sql("select * from la_parquet").collect().foreach{println} [2016-01-01 02:50:59.123] [2016-01-01 01:49:59.123] [2016-01-01 03:39:59.123] [2016-01-01 04:29:59.123] scala> spark.sql("select * from la_textfile").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01 00:39:59.123] [2016-01-01 01:29:59.123] scala> spark.read.json("la_json").collect().foreach{println} [2015-12-31 23:50:59.123] [2015-12-31 22:49:59.123] [2016-01-01 00:39:59.123] [2016-01-01 01:29:59.123] scala> spark.read.json("la_json").join(spark.sql("select * from la_textfile"), "ts").show() ++ | ts| ++ |2015-12-31 23:50:...| |2015-12-31 22:49:...| |2016-01-01 00:39:...| |2016-01-01 01:29:...| ++ scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), "ts").show() +---+ | ts| +---+ +---+ {code} The textfile and json based data shows the same times, and can be joined against each other, while the times from the parquet data have changed (and obviously joins fail). This is a big problem for any organization that may try to read the same data (say in S3) with clusters in multiple timezones. It can also be a nasty surprise as an organization tries to migrate file formats. Finally, its a source of incompatibility between Hive, Impala, and Spark. HIVE-12767 aims to fix this by introducing a table property which indicates the "storage timezone" for the table. Spark should add the same to ensure consistency between file formats and with Hive & Impala. [~rdblue] unless you have any objections I'll update the bug description to reflect this. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Hive has a bug where timestamps in Parquet data are incorrectly adjusted as > though they were in the SQL session time zone to UTC. This is incorrect > behavior because timestamp values are SQL timestamp without time zone and > should not be internally changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib
[ https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926404#comment-15926404 ] yuhao yang commented on SPARK-19957: Thanks for the response. > Inconsist KMeans initialization mode behavior between ML and MLlib > -- > > Key: SPARK-19957 > URL: https://issues.apache.org/jira/browse/SPARK-19957 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: yuhao yang >Priority: Minor > > when users set the initialization mode to "random", KMeans in ML and MLlib > has inconsistent behavior for multiple runs: > MLlib will basically use new Random for each run. > ML Kmeans however will use the default random seed, which is > {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same > number among multiple fitting. > I would expect the "random" initialization mode to be literally random. > There're different solutions with different scope of impact. Adjusting the > hasSeed trait may have a broader impact(but maybe worth discussion). We can > always just set random default seed in KMeans. > Appreciate your feedback. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926393#comment-15926393 ] Sean Owen commented on SPARK-19962: --- Can you describe what this does, and give an example of input/output? I'm not sure what this describes or whether it already exists in Spark. > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19962) add DictVectorizor for DataFrame
yu peng created SPARK-19962: --- Summary: add DictVectorizor for DataFrame Key: SPARK-19962 URL: https://issues.apache.org/jira/browse/SPARK-19962 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.1.0 Reporter: yu peng it's really useful to have something like sklearn.feature_extraction.DictVectorizor Since out features lives in json/data frame like format and classifier/regressors only take vector input. so there is a gap between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19961) unify a exception erro msg for dropdatabase
[ https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19961: -- Affects Version/s: (was: 2.2.0) 2.1.0 Priority: Trivial (was: Minor) Issue Type: Improvement (was: Bug) > unify a exception erro msg for dropdatabase > --- > > Key: SPARK-19961 > URL: https://issues.apache.org/jira/browse/SPARK-19961 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Song Jun >Priority: Trivial > > unify a exception erro msg for dropdatabase when the database still have some > tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19961) unify a exception erro msg for dropdatabase
[ https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19961: Assignee: (was: Apache Spark) > unify a exception erro msg for dropdatabase > --- > > Key: SPARK-19961 > URL: https://issues.apache.org/jira/browse/SPARK-19961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > unify a exception erro msg for dropdatabase when the database still have some > tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19961) unify a exception erro msg for dropdatabase
[ https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926304#comment-15926304 ] Apache Spark commented on SPARK-19961: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/17305 > unify a exception erro msg for dropdatabase > --- > > Key: SPARK-19961 > URL: https://issues.apache.org/jira/browse/SPARK-19961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Priority: Minor > > unify a exception erro msg for dropdatabase when the database still have some > tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19961) unify a exception erro msg for dropdatabase
[ https://issues.apache.org/jira/browse/SPARK-19961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19961: Assignee: Apache Spark > unify a exception erro msg for dropdatabase > --- > > Key: SPARK-19961 > URL: https://issues.apache.org/jira/browse/SPARK-19961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark >Priority: Minor > > unify a exception erro msg for dropdatabase when the database still have some > tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase
Song Jun created SPARK-19961: Summary: unify a exception erro msg for dropdatabase Key: SPARK-19961 URL: https://issues.apache.org/jira/browse/SPARK-19961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun Priority: Minor unify a exception erro msg for dropdatabase when the database still have some tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19944) Move SQLConf from sql/core to sql/catalyst
[ https://issues.apache.org/jira/browse/SPARK-19944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-19944: -- Fix Version/s: 2.1.1 > Move SQLConf from sql/core to sql/catalyst > -- > > Key: SPARK-19944 > URL: https://issues.apache.org/jira/browse/SPARK-19944 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.1, 2.2.0 > > > It is pretty weird to have SQLConf only in sql/core and then we have to > duplicate config options that impact optimizer/analyzer in sql/catalyst using > CatalystConf. This ticket moves SQLConf into sql/catalyst. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`
[ https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19960: Assignee: (was: Apache Spark) > Move `SparkHadoopWriter` to `internal/io/` > -- > > Key: SPARK-19960 > URL: https://issues.apache.org/jira/browse/SPARK-19960 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jiang Xingbo > > We should move `SparkHadoopWriter` to `internal/io/`, that will make it > easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`
[ https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926014#comment-15926014 ] Apache Spark commented on SPARK-19960: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/17304 > Move `SparkHadoopWriter` to `internal/io/` > -- > > Key: SPARK-19960 > URL: https://issues.apache.org/jira/browse/SPARK-19960 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jiang Xingbo > > We should move `SparkHadoopWriter` to `internal/io/`, that will make it > easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`
[ https://issues.apache.org/jira/browse/SPARK-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19960: Assignee: Apache Spark > Move `SparkHadoopWriter` to `internal/io/` > -- > > Key: SPARK-19960 > URL: https://issues.apache.org/jira/browse/SPARK-19960 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark > > We should move `SparkHadoopWriter` to `internal/io/`, that will make it > easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19960) Move `SparkHadoopWriter` to `internal/io/`
Jiang Xingbo created SPARK-19960: Summary: Move `SparkHadoopWriter` to `internal/io/` Key: SPARK-19960 URL: https://issues.apache.org/jira/browse/SPARK-19960 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 2.2.0 Reporter: Jiang Xingbo We should move `SparkHadoopWriter` to `internal/io/`, that will make it easier to consolidate `SparkHadoopWriter` and `SparkHadoopMapReduceWriter`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19112) add codec for ZStandard
[ https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925933#comment-15925933 ] Apache Spark commented on SPARK-19112: -- User 'dongjinleekr' has created a pull request for this issue: https://github.com/apache/spark/pull/17303 > add codec for ZStandard > --- > > Key: SPARK-19112 > URL: https://issues.apache.org/jira/browse/SPARK-19112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Thomas Graves >Priority: Minor > > ZStandard: https://github.com/facebook/zstd and > http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was > recently released. Hadoop > (https://issues.apache.org/jira/browse/HADOOP-13578) and others > (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. > Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19112) add codec for ZStandard
[ https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19112: Assignee: Apache Spark > add codec for ZStandard > --- > > Key: SPARK-19112 > URL: https://issues.apache.org/jira/browse/SPARK-19112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Thomas Graves >Assignee: Apache Spark >Priority: Minor > > ZStandard: https://github.com/facebook/zstd and > http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was > recently released. Hadoop > (https://issues.apache.org/jira/browse/HADOOP-13578) and others > (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. > Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19112) add codec for ZStandard
[ https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19112: Assignee: (was: Apache Spark) > add codec for ZStandard > --- > > Key: SPARK-19112 > URL: https://issues.apache.org/jira/browse/SPARK-19112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Thomas Graves >Priority: Minor > > ZStandard: https://github.com/facebook/zstd and > http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was > recently released. Hadoop > (https://issues.apache.org/jira/browse/HADOOP-13578) and others > (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. > Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null
[ https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19959: Assignee: (was: Apache Spark) > df[java.lang.Long].collect throws NullPointerException if df includes null > -- > > Key: SPARK-19959 > URL: https://issues.apache.org/jira/browse/SPARK-19959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > The following program throws {{NullPointerException}} during the execution of > Java code generated by the wholestage codegen. > {code:java} > sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect > {code} > {code} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null
[ https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925873#comment-15925873 ] Apache Spark commented on SPARK-19959: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17302 > df[java.lang.Long].collect throws NullPointerException if df includes null > -- > > Key: SPARK-19959 > URL: https://issues.apache.org/jira/browse/SPARK-19959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > The following program throws {{NullPointerException}} during the execution of > Java code generated by the wholestage codegen. > {code:java} > sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect > {code} > {code} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null
[ https://issues.apache.org/jira/browse/SPARK-19959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19959: Assignee: Apache Spark > df[java.lang.Long].collect throws NullPointerException if df includes null > -- > > Key: SPARK-19959 > URL: https://issues.apache.org/jira/browse/SPARK-19959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > The following program throws {{NullPointerException}} during the execution of > Java code generated by the wholestage codegen. > {code:java} > sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect > {code} > {code} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19946: Assignee: (was: Apache Spark) > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Priority: Minor > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19946: Assignee: Apache Spark > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Apache Spark >Priority: Minor > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925856#comment-15925856 ] Apache Spark commented on SPARK-19946: -- User 'bogdanrdc' has created a pull request for this issue: https://github.com/apache/spark/pull/17292 > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Priority: Minor > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted
[ https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925841#comment-15925841 ] Herman van Hovell commented on SPARK-18591: --- [~maropu] I don't think the current planning (of aggregates) is optimal - I think your previous work on merging partial aggregates highlights this. I think we should move back to bottom-up planning. This is also very relevant for the current CBO work. cc [~cloud_fan] We discussed something like this a while back cc [~ueshin] You have implemented the current top-down work. Is there an easy way to move back to bottom up? > Replace hash-based aggregates with sort-based ones if inputs already sorted > --- > > Key: SPARK-18591 > URL: https://issues.apache.org/jira/browse/SPARK-18591 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 >Reporter: Takeshi Yamamuro > > Spark currently uses sort-based aggregates only in limited condition; the > cases where spark cannot use partial aggregates and hash-based ones. > However, if input ordering has already satisfied the requirements of > sort-based aggregates, it seems sort-based ones are faster than the other. > {code} > ./bin/spark-shell --conf spark.sql.shuffle.partitions=1 > val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS > value").sort($"key").cache > def timer[R](block: => R): R = { > val t0 = System.nanoTime() > val result = block > val t1 = System.nanoTime() > println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s") > result > } > timer { > df.groupBy("key").count().count > } > // codegen'd hash aggregate > Elapsed time: 7.116962977s > // non-codegen'd sort aggregarte > Elapsed time: 3.088816662s > {code} > If codegen'd sort-based aggregates are supported in SPARK-16844, this seems > to make the performance gap bigger; > {code} > - codegen'd sort aggregate > Elapsed time: 1.645234684s > {code} > Therefore, it'd be better to use sort-based ones in this case. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19959) df[java.lang.Long].collect throws NullPointerException if df includes null
Kazuaki Ishizaki created SPARK-19959: Summary: df[java.lang.Long].collect throws NullPointerException if df includes null Key: SPARK-19959 URL: https://issues.apache.org/jira/browse/SPARK-19959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki The following program throws {{NullPointerException}} during the execution of Java code generated by the wholestage codegen. {code:java} sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect {code} {code} Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:394) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19889) Make TaskContext callbacks synchronized
[ https://issues.apache.org/jira/browse/SPARK-19889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19889. --- Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.2.0 > Make TaskContext callbacks synchronized > --- > > Key: SPARK-19889 > URL: https://issues.apache.org/jira/browse/SPARK-19889 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > In some cases you want to fork of some part of a task to a different thread. > In these cases it would be very useful if {{TaskContext}} callbacks are > synchronized and that we can synchronize on the task context. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925775#comment-15925775 ] Sergey Rubtsov commented on SPARK-19228: Hi [~hyukjin.kwon], Updated pull request: https://github.com/apache/spark/pull/16735 Please, take a look. Couldn't run tests in CSVSuite locally on my Windows OS, apologize for the possible test fails > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > I need to process user.csv like this: > {code} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19954. -- Resolution: Cannot Reproduce This seems fixed in the current master. That code produces the result as below: {code} +---++---+++++ | id|name| id|name|colA|colB|colC| +---++---+++++ | 1| a| 1| a|true|null|null| | 2| b| 2| b|null| 10|null| | 3| c| 3| c|null|null|9.73| +---++---+++++ {code} I am resolving this. It would be nice if someone identifies the JIRA fixing this and backports if needed. > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19954) Joining to a unioned DataFrame does not produce expected result.
[ https://issues.apache.org/jira/browse/SPARK-19954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925754#comment-15925754 ] Takeshi Yamamuro commented on SPARK-19954: -- It seems this issue has already fixed in branch-2.1. So, I think it's okay that you just wait for the next release v2.1.1 (I'm not sure about the release plan though). > Joining to a unioned DataFrame does not produce expected result. > > > Key: SPARK-19954 > URL: https://issues.apache.org/jira/browse/SPARK-19954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Arun Allamsetty >Priority: Blocker > > I found this bug while trying to update from Spark 1.6.1 to 2.1.0. The bug is > that when we try to join two DataFrames, one of which is a result of a union > operation, the result of the join results in data as if the table was joined > only to the first table in the union. This issue is not present in Spark > 2.0.0 or 2.0.1 or 2.0.2, only in 2.1.0. Here's how to reproduce it. > {noformat} > import spark.implicits._ > import org.apache.spark.sql.functions.lit > case class A(id: Long, colA: Boolean) > case class B(id: Long, colB: Int) > case class C(id: Long, colC: Double) > case class X(id: Long, name: String) > val aData = A(1, true) :: Nil > val bData = B(2, 10) :: Nil > val cData = C(3, 9.73D) :: Nil > val xData = X(1, "a") :: X(2, "b") :: X(3, "c") :: Nil > val aDf = spark.createDataset(aData).toDF > val bDf = spark.createDataset(bData).toDF > val cDf = spark.createDataset(cData).toDF > val xDf = spark.createDataset(xData).toDF > val unionDf = > aDf.select($"id", lit("a").as("name"), $"colA", lit(null).as("colB"), > lit(null).as("colC")).union( > bDf.select($"id", lit("b").as("name"), lit(null).as("colA"), $"colB", > lit(null).as("colC"))).union( > cDf.select($"id", lit("c").as("name"), lit(null).as("colA"), > lit(null).as("colB"), $"colC")) > val result = xDf.join(unionDf, unionDf("name") === xDf("name") && > unionDf("id") === xDf("id")) > result.show > {noformat} > The result being > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > +---++---+++++ > {noformat} > Force computing {{unionDf}} using {{count}} does not help change the result > of the join. However, writing the data to disk and reading it back does give > the correct result. But it is definitely not ideal. Interestingly caching the > {{unionDf}} also gives the correct result. > {noformat} > +---++---+++++ > | id|name| id|name|colA|colB|colC| > +---++---+++++ > | 1| a| 1| a|true|null|null| > | 2| b| 2| b|null| 10|null| > | 3| c| 3| c|null|null|9.73| > +---++---+++++ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19957) Inconsist KMeans initialization mode behavior between ML and MLlib
[ https://issues.apache.org/jira/browse/SPARK-19957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925753#comment-15925753 ] Sean Owen commented on SPARK-19957: --- Yeah I think this might be "working as intended". > Inconsist KMeans initialization mode behavior between ML and MLlib > -- > > Key: SPARK-19957 > URL: https://issues.apache.org/jira/browse/SPARK-19957 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: yuhao yang >Priority: Minor > > when users set the initialization mode to "random", KMeans in ML and MLlib > has inconsistent behavior for multiple runs: > MLlib will basically use new Random for each run. > ML Kmeans however will use the default random seed, which is > {code}this.getClass.getName.hashCode.toLong{code}, and keep using the same > number among multiple fitting. > I would expect the "random" initialization mode to be literally random. > There're different solutions with different scope of impact. Adjusting the > hasSeed trait may have a broader impact(but maybe worth discussion). We can > always just set random default seed in KMeans. > Appreciate your feedback. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19958) Support ZStandard Compression
[ https://issues.apache.org/jira/browse/SPARK-19958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lee Dongjin resolved SPARK-19958. - Resolution: Duplicate See: https://issues.apache.org/jira/browse/SPARK-19112 > Support ZStandard Compression > - > > Key: SPARK-19958 > URL: https://issues.apache.org/jira/browse/SPARK-19958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Lee Dongjin > > Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their > recent releases. Supporting this compression codec also requires adding a new > configuration for default compression level, for example, > 'spark.io.compression.zstandard.level.' > [^1]: https://issues.apache.org/jira/browse/HADOOP-13578 > [^2]: https://issues.apache.org/jira/browse/HBASE-16710 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19936) Page rank example takes long time to complete
[ https://issues.apache.org/jira/browse/SPARK-19936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19936. --- Resolution: Not A Problem Set spark.blockManager.port ? > Page rank example takes long time to complete > - > > Key: SPARK-19936 > URL: https://issues.apache.org/jira/browse/SPARK-19936 > Project: Spark > Issue Type: Bug > Components: Block Manager, Mesos >Affects Versions: 2.1.0 > Environment: CentOS 7, Mesos 1.1.0 >Reporter: Stan Teresen > Attachments: 1.stderr, 2.stderr, pr.out > > > Sometimes Page Rank example takes very long time to finish on Mesos due to > exceptions on fetching remote block in RetryingBlockFetcher on executor sides. > As it is seen in the log files attached it took ~30 min for an example to > complete in AWS 3 nodes environment (1 Mesos master and 2 agents). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19895) Spark SQL could not output a correct result
[ https://issues.apache.org/jira/browse/SPARK-19895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19895. --- Resolution: Not A Problem > Spark SQL could not output a correct result > --- > > Key: SPARK-19895 > URL: https://issues.apache.org/jira/browse/SPARK-19895 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bin Wu >Priority: Minor > Labels: beginner > > I'm rewriting pagerank algorithm with Spark SQL, following code can output a > correct result for large data set, but fails on the small data set as I > provided. > from pyspark.sql.functions import * > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("Python Spark SQL basic example") \ > .config("spark.some.config.option", "some-value") \ > .getOrCreate() > numOfIterations = 5 > > lines = spark.read.text("pagerank_data.txt") > a = lines.select(split(lines[0],' ')) > links = a.select(a[0][0].alias('src'), a[0][1].alias('dst')) > outdegrees = links.groupBy('src').count() > ranks = outdegrees.select('src', lit(1).alias('rank')) > for iteration in range(numOfIterations): > contribs = links.join(ranks, 'src').join(outdegrees, 'src').select('dst', > (ranks['rank']/outdegrees['count']).alias('contrib')) > #ranks = > contribs.groupBy('dst').sum('contrib').select(column('dst').alias('src'), > (column('sum(contrib)')*0.85+0.15).alias('rank')) > ranks = > contribs.withColumnRenamed('dst','dst').groupBy('dst').sum('contrib').select(column('dst').alias('src'), > (column('sum(contrib)')*0.85+0.15).alias('rank')) > ranks.orderBy(desc('rank')).show() > pagerank_data.txt: > 1 2 > 1 3 > 1 4 > 2 1 > 3 1 > 4 1 > expected result: > +---+--+ > |src| rank| > +---+--+ > | 1|2.326648124997| > | 3|0.557783958333| > | 2|0.557783958333| > | 4|0.557783958333| > +---+--+ > Wrong result (without "withColumnRenamed") > +---+---+ > |src| rank| > +---+---+ > | 1| 2.326648124997| > | 4|0.3142613194446| > | 3|0.3142613194446| > | 2|0.3142613194446| > | 4|0.3142613194446| > | 4|0.3142613194446| > | 3|0.3142613194446| > | 2|0.3142613194446| > | 3|0.3142613194446| > | 2|0.3142613194446| > +---+---+ > It cannot output correct rank for each node for this small graph only if I > use the "withColumnRenamed". However, on large data set, the line without > withColumnRenamed works correctly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org