[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675656 [Test build #31382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/consoleFull) for PR 5647 at commit [`0319821`](https://github.com/apache/spark/commit/0319821db7406f3cca359af5bc021d2f3fd92a17). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31382/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406960 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, b, a, b, a, b, a, b), The output schema did not preserve the case of the query.) --- End diff -- Yes I think for caseInSensitivity case we should normalize the table name and attribute name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406952 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala --- @@ -29,12 +29,23 @@ package object analysis { /** * Resolver should return true if the first string refers to the same entity as the second string. - * For example, by using case insensitive equality. + * For example, by using case insensitive equality. Besides, Resolver also provides the ability + * to normalize the string according to its semantic. */ - type Resolver = (String, String) = Boolean + trait Resolver { +def apply(a: String, b: String): Boolean +def apply(a: String): String + } + + val caseInsensitiveResolution = new Resolver { +override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) +override def apply(a: String): String = a.toLowerCase // as Hive does --- End diff -- I'd like keep the first `apply` as it was, because I don't want to impact a lots of existed code. I agree we should rename the second `apply` = `normalize`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406904 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -0,0 +1,55 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.execution.stat.FrequentItems + +/** + * :: Experimental :: + * Statistic functions for [[DataFrame]]s. + */ +@Experimental +final class DataFrameStatFunctions private[sql](df: DataFrame) { + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + def freqItems(cols: Seq[String], support: Double): DataFrame = { --- End diff -- also make sure you add a test to the JavaDataFrameSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97683034 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6913][SQL] Fixed java.sql.SQLException...
Github user SlavikBaranov commented on the pull request: https://github.com/apache/spark/pull/5782#issuecomment-97683350 Thanks for comments, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97683624 [Test build #31393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31393/consoleFull) for PR 5744 at commit [`c87f517`](https://github.com/apache/spark/commit/c87f51774a8e4f488557865657e8974d2c06ba4b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance
GitHub user NathanHowell opened a pull request: https://github.com/apache/spark/pull/5801 [SPARK-5938][SQL] Improve JsonRDD performance This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String = Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries I've run some basic queries on a 300MB/100k row dataset with a flat schema and the results are promising: * Before: ```INFO DAGScheduler: Job 8 finished: count at console:20, took 2.916653 s``` * After: ```INFO DAGScheduler: Job 8 finished: count at console:20, took 2.184896 s``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/NathanHowell/spark json-performance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5801.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5801 commit 1e441e23a2cfd8712720a728056e363e41538d1f Author: Nathan Howell nhow...@godaddy.com Date: 2015-04-29T05:44:19Z Eliminate arrow pattern, replace with pattern matches commit 73a56927d09c670eb62317f611c47a90096fe693 Author: Nathan Howell nhow...@godaddy.com Date: 2015-04-27T22:38:28Z Improve JSON parsing and type inference performance commit 1abf1d6010c71cd1cffa97d7564f8fb71eb19f10 Author: Nathan Howell nhow...@godaddy.com Date: 2015-04-30T02:16:33Z Add projection pushdown support to JsonRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97692345 [Test build #31397 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31397/consoleFull) for PR 5680 at commit [`3ad00d9`](https://github.com/apache/spark/commit/3ad00d9d1171cdf0563167a0e368482fb798043b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97692357 Another different case failed. Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29408785 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.streaming.{Time, StreamingContext} + +/** To track the information of input stream at specified batch time. */ +case class InputInfo(batchTime: Time, inputStreamId: Int, numRecords: Long) + +/** + * This class manages all the input streams as well as their input data statistics. The information + * will output to StreamingListener to better monitoring. + */ +private[streaming] class InputInfoTracker(ssc: StreamingContext) extends Logging { + + /** Track all the input streams registered in DStreamGraph */ + val inputStreams = ssc.graph.getInputStreams() --- End diff -- Can this be private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29408797 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.streaming.{Time, StreamingContext} + +/** To track the information of input stream at specified batch time. */ +case class InputInfo(batchTime: Time, inputStreamId: Int, numRecords: Long) + +/** + * This class manages all the input streams as well as their input data statistics. The information + * will output to StreamingListener to better monitoring. + */ +private[streaming] class InputInfoTracker(ssc: StreamingContext) extends Logging { + + /** Track all the input streams registered in DStreamGraph */ + val inputStreams = ssc.graph.getInputStreams() + /** Track all the id of input streams registered in DStreamGraph */ + val inputStreamIds = inputStreams.map(_.id) --- End diff -- Can this be private? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29408816 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.streaming.{Time, StreamingContext} + +/** To track the information of input stream at specified batch time. */ +case class InputInfo(batchTime: Time, inputStreamId: Int, numRecords: Long) + +/** + * This class manages all the input streams as well as their input data statistics. The information + * will output to StreamingListener to better monitoring. + */ +private[streaming] class InputInfoTracker(ssc: StreamingContext) extends Logging { + + /** Track all the input streams registered in DStreamGraph */ + val inputStreams = ssc.graph.getInputStreams() + /** Track all the id of input streams registered in DStreamGraph */ + val inputStreamIds = inputStreams.map(_.id) + + // Map to track all the InputInfo related to specific batch time and input stream. + private val batchTimeToInputInfos = new mutable.HashMap[Time, mutable.HashMap[Int, InputInfo]] + + /** Report the input information with batch time to the tracker */ + def reportInfo(batchTime: Time, inputInfo: InputInfo): Unit = synchronized { +val inputInfos = batchTimeToInputInfos.getOrElseUpdate(batchTime, + new mutable.HashMap[Int, InputInfo]()) + +if (inputInfos.contains(inputInfo.inputStreamId)) { + throw new IllegalStateException(sInput stream ${inputInfo.inputStreamId}} for batch + +s$batchTime is already added into InputInfoTracker, this is a illegal state) +} +inputInfos += ((inputInfo.inputStreamId, inputInfo)) + } + + /** Get the all the input stream's information of specified batch time */ + def getInfo(batchTime: Time): Map[Int, InputInfo] = synchronized { +val inputInfos = batchTimeToInputInfos.get(batchTime) +// Convert mutable HashMap to immutable Map for the caller +inputInfos.map(_.toMap).getOrElse(Map[Int, InputInfo]()) + } + + /** Get the input information of specified batch time and input stream id */ + def getInfoOfBatchAndStream(batchTime: Time, inputStreamId: Int --- End diff -- This is not used anywhere other than tests, is this necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97695062 Yes, I will do this, please take take a look at the whole design, thanks a lot :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29410293 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala --- @@ -70,9 +70,8 @@ private[streaming] class StreamingJobProgressListener(ssc: StreamingContext) runningBatchInfos(batchStarted.batchInfo.batchTime) = batchStarted.batchInfo waitingBatchInfos.remove(batchStarted.batchInfo.batchTime) -batchStarted.batchInfo.receivedBlockInfo.foreach { case (_, infos) = - totalReceivedRecords += infos.map(_.numRecords).sum -} +// TODO. this should be fixed when input stream is not receiver based stream. +totalReceivedRecords += batchStarted.batchInfo.streamIdToNumRecords.values.sum --- End diff -- Yes, will do. Also have one concern, if the `batchStarted` is not a receiver-based batchInfo, so do we need to count this records into `totalReceivedRecords` when batch is just started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6612] [MLLib] [PySpark] Python KMeans p...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5647#issuecomment-97675700 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1406] Mllib pmml model export
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3062 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6846 [WEBUI] Stage kill URL easy to acci...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5528#issuecomment-97676514 That's fine, in the sense that the endpoint returns no data. OK, so it works except for this proxying. Hm, surely the YARN proxy can pass on a POST. We'll have to look into this. Any wisdom from YARN folks about where to look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406469 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, b, a, b, a, b, a, b), The output schema did not preserve the case of the query.) --- End diff -- Supporting normalization is good. However, when explicitly specifying the case in the query, should we need to preserve the case of the query, instead of normalizing it like this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97691954 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97691943 [Test build #31386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/consoleFull) for PR 5799 at commit [`8279d4d`](https://github.com/apache/spark/commit/8279d4d4cb09f78e2f8f83f9a3738101b940ed40). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97691955 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7196][SQL] Support precision and scale ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5777#issuecomment-97695518 @viirya apparently this doesn't fix the problem. Can you look into it more? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29408973 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/BatchInfo.scala --- @@ -32,7 +33,7 @@ import org.apache.spark.streaming.Time @DeveloperApi case class BatchInfo( batchTime: Time, -receivedBlockInfo: Map[Int, Array[ReceivedBlockInfo]], +streamIdToNumRecords: Map[Int, Long], submissionTime: Long, processingStartTime: Option[Long], processingEndTime: Option[Long] --- End diff -- Can you make a method called `numRecords` which returns the sum? This is the same approach taken by @zsxwing in #5533, so will be easier to merge conflicts later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7274][SQL] Create Column expression for...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5802#discussion_r29410114 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -363,6 +380,28 @@ object functions { def sqrt(e: Column): Column = Sqrt(e.expr) /** + * Creates a new struct column. The input column must be a column in a [[DataFrame]], or + * a derived column expression that is named (i.e. aliased). + * + * @group normal_funcs + */ + @scala.annotation.varargs + def struct(cols: Column*): Column = { --- End diff -- Do we allow empty input `struct()`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29410045 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala --- @@ -135,28 +132,25 @@ private[streaming] class StreamingJobProgressListener(ssc: StreamingContext) def receivedRecordsDistributions: Map[Int, Option[Distribution]] = synchronized { val latestBatchInfos = retainedBatches.reverse.take(batchInfoLimit) -val latestBlockInfos = latestBatchInfos.map(_.receivedBlockInfo) -(0 until numReceivers).map { receiverId = - val blockInfoOfParticularReceiver = latestBlockInfos.map { batchInfo = -batchInfo.get(receiverId).getOrElse(Array.empty) - } - val recordsOfParticularReceiver = blockInfoOfParticularReceiver.map { blockInfo = - // calculate records per second for each batch -blockInfo.map(_.numRecords).sum.toDouble * 1000 / batchDuration - } - val distributionOption = Distribution(recordsOfParticularReceiver) - (receiverId, distributionOption) + +// TODO. this should be fixed when receiver-less input stream is mixed into BatchInfo --- End diff -- This is what makes me concern a lot. Now for the `BatchInfo's streamIdToNumRecords`, all the input stream's statistic data will be in it, not receiver-based input stream, so do we need to differentiate the statistics? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97704110 [Test build #31404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31404/consoleFull) for PR 5799 at commit [`3a5c177`](https://github.com/apache/spark/commit/3a5c177e247ddb44a38e4ee4211c57ec3cad58eb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97709044 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97709002 [Test build #31408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31408/consoleFull) for PR 5680 at commit [`8325787`](https://github.com/apache/spark/commit/8325787bf13bcca16a405561413f1d81b3229941). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97709046 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97709011 [Test build #31392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/consoleFull) for PR 5799 at commit [`482e741`](https://github.com/apache/spark/commit/482e74180445d30d0b5a769cd5f9bd0e94abfd17). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch **adds the following new dependencies:** * `jaxb-api-2.2.7.jar` * `jaxb-core-2.2.7.jar` * `jaxb-impl-2.2.7.jar` * `pmml-agent-1.1.15.jar` * `pmml-model-1.1.15.jar` * `pmml-schema-1.1.15.jar` * This patch **removes the following dependencies:** * `activation-1.1.jar` * `jaxb-api-2.2.2.jar` * `jaxb-impl-2.2.3-1.jar` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97708818 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97708865 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97712927 [Test build #31411 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31411/consoleFull) for PR 5680 at commit [`8325787`](https://github.com/apache/spark/commit/8325787bf13bcca16a405561413f1d81b3229941). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5797#issuecomment-97715468 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97715782 [Test build #31412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31412/consoleFull) for PR 5805 at commit [`9ed86ca`](https://github.com/apache/spark/commit/9ed86cabbd07de338adfe3153afa0ed4b005cee7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5797#issuecomment-97715469 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31396/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6967] [SQL] fix date type convertion in...
Github user nadavoosh commented on the pull request: https://github.com/apache/spark/pull/5590#issuecomment-97717732 hi @adrian-wang ! I am using spark and needed to include this fix, since I am reading from a table that has Date types. I just ran into a new problem though: when the Date field has null values, spark throws a java.lang.NullPointerException at org.apache.spark.sql.types.DateUtils$.javaDateToDays(DateUtils.scala:39) error. Any ideas on how I can fix that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97717662 [Test build #31398 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31398/consoleFull) for PR 5609 at commit [`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch **adds the following new dependencies:** * `jaxb-api-2.2.7.jar` * `jaxb-core-2.2.7.jar` * `jaxb-impl-2.2.7.jar` * `pmml-agent-1.1.15.jar` * `pmml-model-1.1.15.jar` * `pmml-schema-1.1.15.jar` * This patch **removes the following dependencies:** * `activation-1.1.jar` * `jaxb-api-2.2.2.jar` * `jaxb-impl-2.2.3-1.jar` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97717674 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31398/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677458 I will let @marmbrus take a look at this tomorrow. Meantime, can you add the apply method and Python getitem method? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677730 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user sven0726 closed the pull request at: https://github.com/apache/spark/pull/5800 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677728 @rxin , already done :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user sven0726 opened a pull request: https://github.com/apache/spark/pull/5800 Merge pull request #1 from apache/master You can merge this pull request into a Git repository by running: $ git pull https://github.com/sven0726/spark-1 master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5800.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5800 commit 5da8c543ff542951c3fefe6e123b891f66edf4b6 Author: sven0726 sven0...@gmail.com Date: 2015-04-27T08:21:55Z Merge pull request #1 from apache/master 2015-04-27第一次merge --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406784 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map = MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size size) { + baseMap += key - count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) = v count) + baseMap.transform((k, v) = v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) = +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq --- End diff -- u don't need this, do you? you can just operate on the map directly. i'm asking because i'm not sure whether baseMap.toSeq materializes a whole seq, which might be unnecessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406755 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala --- @@ -0,0 +1,55 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import org.apache.spark.annotation.Experimental +import org.apache.spark.sql.execution.stat.FrequentItems + +/** + * :: Experimental :: + * Statistic functions for [[DataFrame]]s. + */ +@Experimental +final class DataFrameStatFunctions private[sql](df: DataFrame) { + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * --- End diff -- make sure you document the range of support allowed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97683042 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5797#issuecomment-97688351 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7267][SQL]Push down Project when it's c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5797#issuecomment-97688243 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29408488 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, b, a, b, a, b, a, b), The output schema did not preserve the case of the query.) --- End diff -- OK, I see your point, I will keep minimize the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97695851 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31389/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29409065 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala --- @@ -70,9 +70,8 @@ private[streaming] class StreamingJobProgressListener(ssc: StreamingContext) runningBatchInfos(batchStarted.batchInfo.batchTime) = batchStarted.batchInfo waitingBatchInfos.remove(batchStarted.batchInfo.batchTime) -batchStarted.batchInfo.receivedBlockInfo.foreach { case (_, infos) = - totalReceivedRecords += infos.map(_.numRecords).sum -} +// TODO. this should be fixed when input stream is not receiver based stream. +totalReceivedRecords += batchStarted.batchInfo.streamIdToNumRecords.values.sum --- End diff -- This can be replaced by `batchStarted.batchInfo.numRecords` if you implement `numRecords` as I said above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97695850 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7031][ThriftServer]let thrift server ta...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5609#issuecomment-97695837 [Test build #31389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31389/consoleFull) for PR 5609 at commit [`8d3fc16`](https://github.com/apache/spark/commit/8d3fc16dd22c87fbf768951b64dabe7d121731ec). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97695959 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31388/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7249] Updated Hadoop dependencies due t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5786#issuecomment-97698375 [Test build #31400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31400/consoleFull) for PR 5786 at commit [`7e9955d`](https://github.com/apache/spark/commit/7e9955df29b5d5c9cda950636d51da753e6d17ea). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7274][SQL] Create Column expression for...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5802#issuecomment-97700907 [Test build #31401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31401/consoleFull) for PR 5802 at commit [`0603a91`](https://github.com/apache/spark/commit/0603a915a75ce1429d0ceca843081602ce17c500). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7274][SQL] Create Column expression for...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5802#issuecomment-97700671 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6888][SQL] Make the jdbc driver handlin...
Github user rtreffer commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-97702611 Still no build :cry: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29411128 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, B, a, B, a, b, A, B), --- End diff -- I'm not sure what we really want here. When user `SELECT b FROM t` and `t` has a column `B`, which one should we used in the result schema? `b` or `B`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6999] [SQL] Remove the infinite recursi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5804#issuecomment-97706125 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29412354 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, B, a, B, a, b, A, B), --- End diff -- Does that matter for a case-insensitive system? But we do need keep the attribute name identical in the references chain. This is a workaround approach for the bug fixing, in long term, we probably need to refactor the AttributeReference `equality` for name (or take the Resolver in?). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5342][YARN] Allow long running Spark ap...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4688#issuecomment-97712538 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31395/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4699][SQL] make caseSensitive configura...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5806#issuecomment-97712608 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5342][YARN] Allow long running Spark ap...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4688#issuecomment-97712526 [Test build #31395 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31395/consoleFull) for PR 4688 at commit [`36eb8a9`](https://github.com/apache/spark/commit/36eb8a956c357388e4fdf5858cb4f27236f26a9e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch **removes the following dependencies:** * `RoaringBitmap-0.4.5.jar` * `activation-1.1.jar` * `akka-actor_2.10-2.3.4-spark.jar` * `akka-remote_2.10-2.3.4-spark.jar` * `akka-slf4j_2.10-2.3.4-spark.jar` * `aopalliance-1.0.jar` * `arpack_combined_all-0.1.jar` * `avro-1.7.7.jar` * `breeze-macros_2.10-0.11.2.jar` * `breeze_2.10-0.11.2.jar` * `chill-java-0.5.0.jar` * `chill_2.10-0.5.0.jar` * `commons-beanutils-1.7.0.jar` * `commons-beanutils-core-1.8.0.jar` * `commons-cli-1.2.jar` * `commons-codec-1.10.jar` * `commons-collections-3.2.1.jar` * `commons-compress-1.4.1.jar` * `commons-configuration-1.6.jar` * `commons-digester-1.8.jar` * `commons-httpclient-3.1.jar` * `commons-io-2.1.jar` * `commons-lang-2.5.jar` * `commons-lang3-3.3.2.jar` * `commons-math-2.1.jar` * `commons-math3-3.4.1.jar` * `commons-net-2.2.jar` * `compress-lzf-1.0.0.jar` * `config-1.2.1.jar` * `core-1.1.2.jar` * `curator-client-2.4.0.jar` * `curator-framework-2.4.0.jar` * `curator-recipes-2.4.0.jar` * `gmbal-api-only-3.0.0-b023.jar` * `grizzly-framework-2.1.2.jar` * `grizzly-http-2.1.2.jar` * `grizzly-http-server-2.1.2.jar` * `grizzly-http-servlet-2.1.2.jar` * `grizzly-rcm-2.1.2.jar` * `groovy-all-2.3.7.jar` * `guava-14.0.1.jar` * `guice-3.0.jar` * `hadoop-annotations-2.2.0.jar` * `hadoop-auth-2.2.0.jar` * `hadoop-client-2.2.0.jar` * `hadoop-common-2.2.0.jar` * `hadoop-hdfs-2.2.0.jar` * `hadoop-mapreduce-client-app-2.2.0.jar` * `hadoop-mapreduce-client-common-2.2.0.jar` * `hadoop-mapreduce-client-core-2.2.0.jar` * `hadoop-mapreduce-client-jobclient-2.2.0.jar` * `hadoop-mapreduce-client-shuffle-2.2.0.jar` * `hadoop-yarn-api-2.2.0.jar` * `hadoop-yarn-client-2.2.0.jar` * `hadoop-yarn-common-2.2.0.jar` * `hadoop-yarn-server-common-2.2.0.jar` * `ivy-2.4.0.jar` * `jackson-annotations-2.4.0.jar` * `jackson-core-2.4.4.jar` * `jackson-core-asl-1.8.8.jar` * `jackson-databind-2.4.4.jar` * `jackson-jaxrs-1.8.8.jar` * `jackson-mapper-asl-1.8.8.jar` * `jackson-module-scala_2.10-2.4.4.jar` * `jackson-xc-1.8.8.jar` * `jansi-1.4.jar` * `javax.inject-1.jar` * `javax.servlet-3.0.0.v201112011016.jar` * `javax.servlet-3.1.jar` * `javax.servlet-api-3.0.1.jar` * `jaxb-api-2.2.2.jar` * `jaxb-impl-2.2.3-1.jar` * `jcl-over-slf4j-1.7.10.jar` * `jersey-client-1.9.jar` * `jersey-core-1.9.jar` * `jersey-grizzly2-1.9.jar` * `jersey-guice-1.9.jar` * `jersey-json-1.9.jar` * `jersey-server-1.9.jar` * `jersey-test-framework-core-1.9.jar` * `jersey-test-framework-grizzly2-1.9.jar` * `jets3t-0.7.1.jar` * `jettison-1.1.jar` * `jetty-util-6.1.26.jar` * `jline-0.9.94.jar` * `jline-2.10.4.jar` * `jodd-core-3.6.3.jar` * `json4s-ast_2.10-3.2.10.jar` * `json4s-core_2.10-3.2.10.jar` * `json4s-jackson_2.10-3.2.10.jar` * `jsr305-1.3.9.jar` * `jtransforms-2.4.0.jar` * `jul-to-slf4j-1.7.10.jar` * `kryo-2.21.jar` * `log4j-1.2.17.jar` * `lz4-1.2.0.jar` * `management-api-3.0.0-b012.jar` * `mesos-0.21.0-shaded-protobuf.jar` * `metrics-core-3.1.0.jar` * `metrics-graphite-3.1.0.jar` * `metrics-json-3.1.0.jar` * `metrics-jvm-3.1.0.jar` * `minlog-1.2.jar` * `netty-3.8.0.Final.jar` * `netty-all-4.0.23.Final.jar` * `objenesis-1.2.jar` * `opencsv-2.3.jar` * `oro-2.0.8.jar` * `paranamer-2.6.jar` * `parquet-column-1.6.0rc3.jar` * `parquet-common-1.6.0rc3.jar` * `parquet-encoding-1.6.0rc3.jar` * `parquet-format-2.2.0-rc1.jar` * `parquet-generator-1.6.0rc3.jar` * `parquet-hadoop-1.6.0rc3.jar` * `parquet-jackson-1.6.0rc3.jar` * `protobuf-java-2.4.1.jar` * `protobuf-java-2.5.0-spark.jar` * `py4j-0.8.2.1.jar` * `pyrolite-2.0.1.jar` * `quasiquotes_2.10-2.0.1.jar` * `reflectasm-1.07-shaded.jar` * `scala-compiler-2.10.4.jar` * `scala-library-2.10.4.jar` *
[GitHub] spark pull request: [SPARK-4699][SQL] make caseSensitive configura...
GitHub user scwf opened a pull request: https://github.com/apache/spark/pull/5806 [SPARK-4699][SQL] make caseSensitive configurable in Analyzer.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/scwf/spark case Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5806.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5806 commit 578d167bfccdc2d1d5ce9ca06cab7b7b753bb3eb Author: Jacky Li jacky.li...@huawei.com Date: 2014-12-02T17:33:44Z make caseSensitive configurable commit f57f15ce72652b3a04229a860a2aba22297368b8 Author: Jacky Li jacky.li...@huawei.com Date: 2014-12-03T17:48:04Z add testcase commit 91b1b9606055211cfab409dbdecaa708aa83be34 Author: Jacky Li jacky.li...@huawei.com Date: 2014-12-20T16:38:19Z make caseSensitive configurable in Analyzer commit e7bca31f6856a4fe2e301bc2ea608d709dcbe334 Author: Jacky Li jacky.li...@huawei.com Date: 2014-12-20T17:25:20Z make caseSensitive configuration in Analyzer and Catalog commit fcbf0d9162574cf6f28dc703224e23d357f0aad9 Author: Jacky Li jacky.li...@huawei.com Date: 2014-12-20T17:36:44Z fix scalastyle check commit 6332e0ffeac2180406cabfe789ef0ba697b49fa9 Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-03T13:56:51Z fix bug commit 005c56d7a4a9c0797870810efe227d3cef225b12 Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-03T14:08:27Z make SQLContext caseSensitivity configurable commit 9bf4cc7dbb069c4969c5f317590e3e9ddc4efd4f Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-03T14:39:10Z fix bug in catalyst commit 73c16b13b23e2b9e98ac6fb1864d8c98a3813dfb Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-03T17:02:17Z fix bug in sql/hive commit 05b09a3c1008869571e438c12e8593def7ecdc2c Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-19T07:42:39Z fix conflict base on the latest master branch commit dee56e9ae71ebd9c8464cf6be763895e8bcdf2e6 Author: Jacky Li jacky.li...@huawei.com Date: 2015-01-19T09:55:50Z fix test case failure commit 39e369c67f92105956486624cbc7d937627fd141 Author: Jacky Li jacky.li...@huawei.com Date: 2015-02-03T17:42:21Z fix confilct after DataFrame PR commit 12eca9a71d05fe74d44e4298f0587af31bf380d4 Author: Jacky Li jacky.li...@huawei.com Date: 2015-02-21T15:29:16Z solve conflict with master commit 664d1e9e610f2bef172cc3a10de452f1752ca51b Author: Jacky Li jacky.li...@huawei.com Date: 2015-02-21T15:30:37Z Merge branch 'master' of https://github.com/apache/spark into case commit 56034ca4baa25819b322905490cb0b75543f500c Author: wangfei wangf...@huawei.com Date: 2015-04-30T07:47:38Z fix conflicts and improve for catalystconf commit 5472b0832213aa0d7f092c06f54095477e695c93 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:17:28Z fix compile issue commit 69b3b708c2b78ed2e1061d69ef3e7c3b5e2d94c6 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:32:35Z fix AnalysisSuite commit fd30e25f84e569769519282cc3ec39e58a200e87 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:34:34Z added override commit 966e719b77e3e8e3e715e58f3a0aeed3b4aba009 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:41:13Z set CASE_SENSITIVE false in hivecontext commit 5d7c45618bcc0ba1195e406230972a9c237016c7 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:46:35Z set CASE_SENSITIVE false in TestHive commit 6ef31cfb5269e6298349cf97fbe28fcfa43c26ec Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:53:07Z revert pom changes commit eee75bad4d7eacb73cfc57ea733aed1dcd97ec11 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:55:34Z fix EmptyConf commit d5a99337c86c92e705098b95f844b928e5129213 Author: wangfei wangf...@huawei.com Date: 2015-04-30T08:59:04Z fix style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97678166 [Test build #31390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31390/consoleFull) for PR 5744 at commit [`51719b7`](https://github.com/apache/spark/commit/51719b7f612859219ba31658da4e9582c6ef2856). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5744#issuecomment-97677743 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7133][SQL] Implement struct, array, and...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5744#discussion_r29406389 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1166,7 +1166,7 @@ def __init__(self, jc): # container operators __contains__ = _bin_op(contains) -__getitem__ = _bin_op(getItem) +__getitem__ = _bin_op(apply) --- End diff -- can we add a unit test? you can add it in tests.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97681263 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5799#discussion_r29406968 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala --- @@ -0,0 +1,127 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.stat + +import org.apache.spark.Logging +import org.apache.spark.sql.{Column, DataFrame, Row} +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types.{ArrayType, StructField, StructType} + +import scala.collection.mutable.{Map = MutableMap} + +private[sql] object FrequentItems extends Logging { + + /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ + private class FreqItemCounter(size: Int) extends Serializable { +val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] + +/** + * Add a new example to the counts if it exists, otherwise deduct the count + * from existing items. + */ +def add(key: Any, count: Long): this.type = { + if (baseMap.contains(key)) { +baseMap(key) += count + } else { +if (baseMap.size size) { + baseMap += key - count +} else { + // TODO: Make this more efficient... A flatMap? + baseMap.retain((k, v) = v count) + baseMap.transform((k, v) = v - count) +} + } + this +} + +/** + * Merge two maps of counts. + * @param other The map containing the counts for that partition + */ +def merge(other: FreqItemCounter): this.type = { + other.toSeq.foreach { case (k, v) = +add(k, v) + } + this +} + +def toSeq: Seq[(Any, Long)] = baseMap.toSeq + +def foldLeft[A, B](start: A)(f: (A, (Any, Long)) = A): A = baseMap.foldLeft(start)(f) + +def freqItems: Seq[Any] = baseMap.keys.toSeq + } + + /** + * Finding frequent items for columns, possibly with false positives. Using the + * frequent element count algorithm described in + * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]]. + * For Internal use only. + * + * @param df The input DataFrame + * @param cols the names of the columns to search frequent items in + * @param support The minimum frequency for an item to be considered `frequent` + * @return A Local DataFrame with the Array of frequent items for each column. + */ + private[sql] def singlePassFreqItems( + df: DataFrame, + cols: Seq[String], + support: Double): DataFrame = { +if (support 1e-6) { --- End diff -- ```scala require(support = 1e-6, ssupport ($support) must be greater than 1e-6.) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29407261 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala --- @@ -29,12 +29,23 @@ package object analysis { /** * Resolver should return true if the first string refers to the same entity as the second string. - * For example, by using case insensitive equality. + * For example, by using case insensitive equality. Besides, Resolver also provides the ability + * to normalize the string according to its semantic. */ - type Resolver = (String, String) = Boolean + trait Resolver { +def apply(a: String, b: String): Boolean +def apply(a: String): String + } + + val caseInsensitiveResolution = new Resolver { +override def apply(a: String, b: String): Boolean = a.equalsIgnoreCase(b) +override def apply(a: String): String = a.toLowerCase // as Hive does --- End diff -- If we want to add this, I think we should call it normalize. Maybe change the first apply to something else in the future. I'm not sure if we need to add this though. I will let @marmbrus comment on that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29410424 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala --- @@ -95,7 +95,7 @@ private[ui] class StreamingPage(parent: StreamingTab) Maximum rate\n[events/sec], Last Error ) - val dataRows = (0 until listener.numReceivers).map { receiverId = --- End diff -- Now all the input streams will have a unique id (not only receiver based input streams), so assuming this will get error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97702165 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97702224 [Test build #31403 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31403/consoleFull) for PR 5798 at commit [`1f0ed92`](https://github.com/apache/spark/commit/1f0ed9236527bf1071f2cc4a5815f5f705f85dc5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97713353 [Test build #31406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31406/consoleFull) for PR 5805 at commit [`92aa76f`](https://github.com/apache/spark/commit/92aa76fc559499470595fcd772d750b34d128cc6). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch **adds the following new dependencies:** * `jaxb-api-2.2.7.jar` * `jaxb-core-2.2.7.jar` * `jaxb-impl-2.2.7.jar` * `pmml-agent-1.1.15.jar` * `pmml-model-1.1.15.jar` * `pmml-schema-1.1.15.jar` * This patch **removes the following dependencies:** * `activation-1.1.jar` * `jaxb-api-2.2.2.jar` * `jaxb-impl-2.2.3-1.jar` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97713362 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97713365 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31406/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679106 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5799#issuecomment-97679097 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680177 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680176 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7232] [SQL] Add a Substitution batch fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5776#issuecomment-97680167 [Test build #31385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31385/consoleFull) for PR 5776 at commit [`553005a`](https://github.com/apache/spark/commit/553005a4e9aebcbb42c712efd833118235d205dc). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29406841 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,13 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, b, a, b, a, b, a, b), The output schema did not preserve the case of the query.) --- End diff -- In Hive ``` hive create table ddDD as select Key, valUe from src; hive desc extended ; OK key string value string Detailed Table Information Table(tableName:, dbName:default, owner:hcheng, createTime:1430368423, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:string, comment:null), FieldSchema(name:value, type:string, comment:null)], location:file:/home/hcheng/warehouse/, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1430368423, numRows=0, totalSize=5824, rawDataSize=0}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE) Time taken: 0.111 seconds, Fetched: 4 row(s) ``` You will see both table name column names are normalized (to lower case), so I think it's probably not necessary for the preservation (Normalized name is what we want, doesn't it?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming] Add a DirectStreamTrac...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97692199 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming] Add a DirectStreamTrac...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97692178 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/5680#issuecomment-97694754 There are merge conflicts! Please merge master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97695946 [Test build #31388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31388/consoleFull) for PR 5392 at commit [`72304f0`](https://github.com/apache/spark/commit/72304f0150e74eb6432fc3141d3d5bc71bb93d61). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class Heartbeat(workerId: String, worker: RpcEndpointRef) extends DeployMessage` * ` case class RegisteredWorker(master: RpcEndpointRef, masterWebUiUrl: String) extends DeployMessage` * ` case class RegisterApplication(appDescription: ApplicationDescription, driver: RpcEndpointRef)` * ` case class RegisteredApplication(appId: String, master: RpcEndpointRef) extends DeployMessage` * ` case class MasterChanged(master: RpcEndpointRef, masterWebUiUrl: String)` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Update Master, Worker, Clie...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5392#issuecomment-97695958 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29409167 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.scheduler + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.streaming.{Time, StreamingContext} + +/** To track the information of input stream at specified batch time. */ +case class InputInfo(batchTime: Time, inputStreamId: Int, numRecords: Long) + +/** + * This class manages all the input streams as well as their input data statistics. The information + * will output to StreamingListener to better monitoring. + */ +private[streaming] class InputInfoTracker(ssc: StreamingContext) extends Logging { + + /** Track all the input streams registered in DStreamGraph */ + val inputStreams = ssc.graph.getInputStreams() + /** Track all the id of input streams registered in DStreamGraph */ + val inputStreamIds = inputStreams.map(_.id) + + // Map to track all the InputInfo related to specific batch time and input stream. + private val batchTimeToInputInfos = new mutable.HashMap[Time, mutable.HashMap[Int, InputInfo]] + + /** Report the input information with batch time to the tracker */ + def reportInfo(batchTime: Time, inputInfo: InputInfo): Unit = synchronized { +val inputInfos = batchTimeToInputInfos.getOrElseUpdate(batchTime, + new mutable.HashMap[Int, InputInfo]()) + +if (inputInfos.contains(inputInfo.inputStreamId)) { + throw new IllegalStateException(sInput stream ${inputInfo.inputStreamId}} for batch + +s$batchTime is already added into InputInfoTracker, this is a illegal state) +} +inputInfos += ((inputInfo.inputStreamId, inputInfo)) + } + + /** Get the all the input stream's information of specified batch time */ + def getInfo(batchTime: Time): Map[Int, InputInfo] = synchronized { +val inputInfos = batchTimeToInputInfos.get(batchTime) +// Convert mutable HashMap to immutable Map for the caller +inputInfos.map(_.toMap).getOrElse(Map[Int, InputInfo]()) + } + + /** Get the input information of specified batch time and input stream id */ + def getInfoOfBatchAndStream(batchTime: Time, inputStreamId: Int --- End diff -- yes, only used for test, I can remove it if necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7196][SQL] Support precision and scale ...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/5777#issuecomment-97696066 @rxin ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7112][Streaming][WIP] Add a InputInfoTr...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/5680#discussion_r29409159 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala --- @@ -135,28 +132,25 @@ private[streaming] class StreamingJobProgressListener(ssc: StreamingContext) def receivedRecordsDistributions: Map[Int, Option[Distribution]] = synchronized { val latestBatchInfos = retainedBatches.reverse.take(batchInfoLimit) -val latestBlockInfos = latestBatchInfos.map(_.receivedBlockInfo) -(0 until numReceivers).map { receiverId = - val blockInfoOfParticularReceiver = latestBlockInfos.map { batchInfo = -batchInfo.get(receiverId).getOrElse(Array.empty) - } - val recordsOfParticularReceiver = blockInfoOfParticularReceiver.map { blockInfo = - // calculate records per second for each batch -blockInfo.map(_.numRecords).sum.toDouble * 1000 / batchDuration - } - val distributionOption = Distribution(recordsOfParticularReceiver) - (receiverId, distributionOption) + +// TODO. this should be fixed when receiver-less input stream is mixed into BatchInfo --- End diff -- What does this to do mean? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97701616 Thank you for the comments, I've updated the code for preserving the attribute name. Attribute name normalization seems still require some discussion, let's keep it for the future improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97706740 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/5798#discussion_r29411449 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveResolutionSuite.scala --- @@ -81,9 +81,11 @@ class HiveResolutionSuite extends HiveComparisonTest { .toDF().registerTempTable(caseSensitivityTest) val query = sql(SELECT a, b, A, B, n.a, n.b, n.A, n.B FROM caseSensitivityTest) -assert(query.schema.fields.map(_.name) === Seq(a, b, A, B, a, b, A, B), +assert(query.schema.fields.map(_.name) === Seq(a, B, a, B, a, b, A, B), --- End diff -- I'm not sure what we really want here. When user `SELECT b FROM t` and `t` has a column `B`, which one should we used in the result schema? `b` or `B`? cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-7171: Added a method to retrieve metrics...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5805#issuecomment-97706828 [Test build #31406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31406/consoleFull) for PR 5805 at commit [`92aa76f`](https://github.com/apache/spark/commit/92aa76fc559499470595fcd772d750b34d128cc6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6913][SQL] Fixed java.sql.SQLException...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5782#issuecomment-97715156 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7269] [SQL] Incorrect analysis for aggr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5798#issuecomment-97679300 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org