[GitHub] spark pull request: [Build] Uploads HiveCompatibilitySuite logs
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/2993 [Build] Uploads HiveCompatibilitySuite logs In addition to unit-tests.log files, also upload failure output files generated by `HiveCompatibilitySuite` to Jenkins master. These files can be very helpful to debug Hive compatibility test failures. /cc @pwendell @marmbrus You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark upload-hive-compat-logs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2993.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2993 commit 8e6247fd410ee55560e76a4387671a48d68edcd5 Author: Cheng Lian l...@databricks.com Date: 2014-10-29T03:47:31Z Uploads HiveCompatibilitySuite logs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60877916 [Test build #22430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22430/consoleFull) for PR 2944 at commit [`f4ac1c1`](https://github.com/apache/spark/commit/f4ac1c1099019cc43525508545452e22eb677a70). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60877993 I've rewritten this patch so that thread dumps are triggered on-demand using a new driver - executor RPC channel. There are a few hacks involved in setting this up, mostly to work around some limitations of our current executor registration mechanisms without resorting to heavy refactoring. Please take a look and let me know what you think. I'm going to leave a couple of comments on the diff to help explain the hackier parts of these changes. Also, we still might want to enable compression of the stacktrace RPCs, but I'll leave that to a separate PR (this would be a good starter JIRA task, BTW). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60878081 [Test build #22430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22430/consoleFull) for PR 2944 at commit [`f4ac1c1`](https://github.com/apache/spark/commit/f4ac1c1099019cc43525508545452e22eb677a70). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster` * `class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage(threadDump) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] Uploads HiveCompatibilitySuite logs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2993#issuecomment-60878064 [Test build #22431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22431/consoleFull) for PR 2993 at commit [`8e6247f`](https://github.com/apache/spark/commit/8e6247fd410ee55560e76a4387671a48d68edcd5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60878083 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22430/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2944#discussion_r19521370 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala --- @@ -412,6 +415,17 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus Seq.empty } } + + private def getActorSystemHostPortForExecutor(executorId: String): Option[(String, Int)] = { --- End diff -- This verbosely-named method is the big hack for enabling driver - executor RPC. Basically, I needed to have a way to address the remote `ExecutorActor` from the driver. Here, we rely on the fact that every executor registers a BlockManager actor with the BlockManagerMasterActor and that there is only one actor system per executor / driver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60878201 [Test build #22432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22432/consoleFull) for PR 2992 at commit [`2b5e882`](https://github.com/apache/spark/commit/2b5e8828a6db72adee10cfbdc71f07d372f43f90). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/2841#discussion_r19521440 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -423,10 +436,8 @@ private[parquet] class FilteringParquetRowInputFormat configuration: Configuration, footers: JList[Footer]): JList[ParquetInputSplit] = { -import FilteringParquetRowInputFormat.blockLocationCache - -val cacheMetadata = configuration.getBoolean(SQLConf.PARQUET_CACHE_METADATA, false) - +// Use task side strategy by default +val taskSideMetaData = configuration.getBoolean(ParquetInputFormat.TASK_SIDE_METADATA, true) --- End diff -- yes , in parquet , clientsidemetadata strategy is the default one, but as I mentioned earlier, don't we want task side strategy by default due to the inherent advantages of less memory usage at client side ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2944#discussion_r19521449 --- Diff: core/src/main/scala/org/apache/spark/ui/exec/ThreadDumpPage.scala --- @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ui.exec + +import javax.servlet.http.HttpServletRequest + +import scala.util.Try +import scala.xml.{Text, Node} + +import org.apache.spark.ui.{UIUtils, WebUIPage} + +class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage(threadDump) { + + private val sc = parent.sc + + def render(request: HttpServletRequest): Seq[Node] = { +val executorId = Option(request.getParameter(executorId)).getOrElse { + return Text(sMissing executorId parameter) +} +val time = System.currentTimeMillis() +val maybeThreadDump = Try(sc.get.getExecutorThreadDump(executorId)) --- End diff -- This is a blocking call. From a high-performance HTTP server standpoint, this is probably a bad idea; it might improve throughput to handle this in some sort of future / continuation so that we don't starve the request handling threadpool while waiting on a remote RPC. On the other hand, I don't think that we should over-engineer things right now; we can always move towards a fancier request handling strategy if we discover that we need it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2753#issuecomment-60878626 [Test build #22427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22427/consoleFull) for PR 2753 at commit [`cadfd28`](https://github.com/apache/spark/commit/cadfd28f116f0dbca11e580a23caf82060bcf922). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2753#issuecomment-60878629 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22427/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60878689 [Test build #22433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22433/consoleFull) for PR 2944 at commit [`bc1e675`](https://github.com/apache/spark/commit/bc1e675f0204e6f111cf7ec49cdb318150ecef54). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4109][CORE] Correctly deserialize Task....
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2971#discussion_r19521525 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -128,7 +128,7 @@ private[spark] class ShuffleMapTask( } override def readExternal(in: ObjectInput) { --- End diff -- I guess I missed this case in my PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4109][CORE] Correctly deserialize Task....
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2971#discussion_r19521513 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -128,7 +128,7 @@ private[spark] class ShuffleMapTask( } override def readExternal(in: ObjectInput) { --- End diff -- As long as you're modifying this code, mind tossing a `Utils.tryOrIOException` here so that any errors that occur here are reported properly? See #2932 for explanation / context. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
GitHub user harishreedharan opened a pull request: https://github.com/apache/spark/pull/2994 [SPARK-4122][STREAMING] Add a library that can write data back to Kafka ... ...from Spark Streaming. This adds a library that can writes dstreams to Kafka. An implicit also has been added so users can call dstream.writeToKafka(..) You can merge this pull request into a Git repository by running: $ git pull https://github.com/harishreedharan/spark Kafka-output Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2994.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2994 commit f61d82f3152f8e6cf758fa349c8198289e0deae8 Author: Hari Shreedharan hshreedha...@apache.org Date: 2014-10-29T06:06:33Z [SPARK-4122][STREAMING] Add a library that can write data back to Kafka from Spark Streaming. This adds a library that can writes dstreams to Kafka. An implicit also has been added so users can call dstream.writeToKafka(..) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60878843 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22425/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60878841 [Test build #22425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22425/consoleFull) for PR 2991 at commit [`5cc4cb1`](https://github.com/apache/spark/commit/5cc4cb198662cb35008d9a2e46e320f75ce35a71). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60878899 [Test build #22433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22433/consoleFull) for PR 2944 at commit [`bc1e675`](https://github.com/apache/spark/commit/bc1e675f0204e6f111cf7ec49cdb318150ecef54). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class GetActorSystemHostPortForExecutor(executorId: String) extends ToBlockManagerMaster` * `class ThreadDumpPage(parent: ExecutorsTab) extends WebUIPage(threadDump) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-611] [WIP] Display executor thread dump...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2944#issuecomment-60878901 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22433/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60879163 [Test build #22434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22434/consoleFull) for PR 2994 at commit [`f61d82f`](https://github.com/apache/spark/commit/f61d82f3152f8e6cf758fa349c8198289e0deae8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60879456 [Test build #22435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22435/consoleFull) for PR 2994 at commit [`372c749`](https://github.com/apache/spark/commit/372c749458e22ba1a9acd2badbfad51e4dda3968). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/2841#discussion_r19521851 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala --- @@ -460,29 +515,85 @@ private[parquet] class FilteringParquetRowInputFormat val status = fileStatuses.getOrElse(file, fs.getFileStatus(file)) val parquetMetaData = footer.getParquetMetadata val blocks = parquetMetaData.getBlocks - var blockLocations: Array[BlockLocation] = null - if (!cacheMetadata) { -blockLocations = fs.getFileBlockLocations(status, 0, status.getLen) - } else { -blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] { - def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen) -}) - } + totalRowGroups = totalRowGroups + blocks.size + val filteredBlocks = RowGroupFilter.filterRowGroups( +filter, +blocks, +parquetMetaData.getFileMetaData.getSchema) + rowGroupsDropped = rowGroupsDropped + (blocks.size - filteredBlocks.size) + + if (!filteredBlocks.isEmpty){ + var blockLocations: Array[BlockLocation] = null + if (!cacheMetadata) { +blockLocations = fs.getFileBlockLocations(status, 0, status.getLen) + } else { +blockLocations = blockLocationCache.get(status, new Callable[Array[BlockLocation]] { + def call(): Array[BlockLocation] = fs.getFileBlockLocations(status, 0, status.getLen) +}) + } + splits.addAll( +generateSplits.invoke( + null, + filteredBlocks, + blockLocations, + status, + readContext.getRequestedSchema.toString, + readContext.getReadSupportMetadata, + minSplitSize, + maxSplitSize).asInstanceOf[JList[ParquetInputSplit]]) +} +} + +if (rowGroupsDropped 0 totalRowGroups 0){ + val percentDropped = ((rowGroupsDropped/totalRowGroups.toDouble) * 100).toInt + logInfo(sDropping $rowGroupsDropped row groups that do not pass filter predicate ++ s($percentDropped %) !) +} +else { + logInfo(There were no row groups that could be dropped due to filter predicates) +} +splits + + } + + def getTaskSideSplits( +configuration: Configuration, +footers: JList[Footer], +maxSplitSize: JLong, +minSplitSize: JLong, +readContext: ReadContext): JList[ParquetInputSplit] = { + +val splits = mutable.ArrayBuffer.empty[ParquetInputSplit] + +// Ugly hack, stuck with it until PR: +// https://github.com/apache/incubator-parquet-mr/pull/17 +// is resolved +val generateSplits = + Class.forName(parquet.hadoop.TaskSideMetadataSplitStrategy) + .getDeclaredMethods.find(_.getName == generateTaskSideMDSplits).getOrElse( + sys.error( + sFailed to reflectively invoke TaskSideMetadataSplitStrategy.generateTaskSideMDSplits)) +generateSplits.setAccessible(true) + +for (footer - footers) { + val file = footer.getFile + val fs = file.getFileSystem(configuration) + val status = fileStatuses.getOrElse(file, fs.getFileStatus(file)) + val blockLocations = fs.getFileBlockLocations(status, 0, status.getLen) splits.addAll( --- End diff -- I am not sure I follow here, if globalmetadata is non null, it means there is data and hence splits would be generated by the generatesplits function which takes the hdfsblocks to process as argument ? the generated splits are added to the splits to be returned later. splits.addAll( generateSplits.invoke( null, filteredBlocks, blockLocations, status, readContext.getRequestedSchema.toString, readContext.getReadSupportMetadata, minSplitSize, maxSplitSize).asInstanceOf[JList[ParquetInputSplit]]) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/2841#issuecomment-60880619 Hi @mateiz , thanks for the suggestions, just a few points 1. Need to know which strategy to be kept as default (currently we use a different one than the default one in parquet library) 2. This PR is adding support to use filter2 api from the parquet library which supports row group filtering. Do we need to add tests to ensure that ? because such test cases already exist in the parquet library : https://github.com/Parquet/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/test/java/parquet/filter2/compat/TestRowGroupFilter.java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3795] Heuristics for dynamically scalin...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2746#issuecomment-60880622 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60880659 [Test build #22436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22436/consoleFull) for PR 2615 at commit [`95c2e8e`](https://github.com/apache/spark/commit/95c2e8e86f69f3b1d11aad04f6ad14b0ede1950a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60880665 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22428/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60880661 [Test build #22428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22428/consoleFull) for PR 2942 at commit [`9f7aea9`](https://github.com/apache/spark/commit/9f7aea9eac3c64f646d1783909e0e2d155663399). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StreamingKMeansModel(` * `class StreamingKMeans(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2753#issuecomment-60880960 [Test build #22437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22437/consoleFull) for PR 2753 at commit [`cadfd28`](https://github.com/apache/spark/commit/cadfd28f116f0dbca11e580a23caf82060bcf922). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-3839] Reimplement Left/Right ...
Github user Ishiihara commented on the pull request: https://github.com/apache/spark/pull/2723#issuecomment-60881119 @marmbrus All test failures have the same pattern select * from a right outer join b on condition1 join c on condition2 With the extra join operation, somehow the iterator of a join b does not output all the values of the join result. However, select * from a right outer join b on a.key = b.key returns the correct result. I am currently investigating the cause of the failures, please let me know if you have any ideas. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60881601 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22429/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60881653 [Test build #22432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22432/consoleFull) for PR 2992 at commit [`2b5e882`](https://github.com/apache/spark/commit/2b5e8828a6db72adee10cfbdc71f07d372f43f90). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class DeferredObjectAdapter(oi: ObjectInspector) extends DeferredObject ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60881594 [Test build #22429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22429/consoleFull) for PR 2992 at commit [`ebe3e74`](https://github.com/apache/spark/commit/ebe3e74df70eb424aecc3170fc55008cfb6a76ec). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60881656 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22432/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60881849 [Test build #22438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22438/consoleFull) for PR 2991 at commit [`09d57c5`](https://github.com/apache/spark/commit/09d57c5270cf876e52d49db702e2330c2b6a6e10). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] Uploads HiveCompatibilitySuite logs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2993#issuecomment-60882504 [Test build #22431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22431/consoleFull) for PR 2993 at commit [`8e6247f`](https://github.com/apache/spark/commit/8e6247fd410ee55560e76a4387671a48d68edcd5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Build] Uploads HiveCompatibilitySuite logs
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2993#issuecomment-60882509 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22431/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on a diff in the pull request: https://github.com/apache/spark/pull/2841#discussion_r19522625 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala --- @@ -209,25 +221,25 @@ private[sql] object ParquetFilters { case _ = None } } - case p @ EqualTo(left: Literal, right: NamedExpression) if !right.nullable = + case p @ EqualTo(left: Literal, right: NamedExpression) = Some(createEqualityFilter(right.name, left, p)) - case p @ EqualTo(left: NamedExpression, right: Literal) if !left.nullable = + case p @ EqualTo(left: NamedExpression, right: Literal) = Some(createEqualityFilter(left.name, right, p)) - case p @ LessThan(left: Literal, right: NamedExpression) if !right.nullable = + case p @ LessThan(left: Literal, right: NamedExpression) = Some(createLessThanFilter(right.name, left, p)) - case p @ LessThan(left: NamedExpression, right: Literal) if !left.nullable = + case p @ LessThan(left: NamedExpression, right: Literal) = Some(createLessThanFilter(left.name, right, p)) - case p @ LessThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable = + case p @ LessThanOrEqual(left: Literal, right: NamedExpression) = Some(createLessThanOrEqualFilter(right.name, left, p)) - case p @ LessThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable = + case p @ LessThanOrEqual(left: NamedExpression, right: Literal) = Some(createLessThanOrEqualFilter(left.name, right, p)) - case p @ GreaterThan(left: Literal, right: NamedExpression) if !right.nullable = + case p @ GreaterThan(left: Literal, right: NamedExpression) = Some(createGreaterThanFilter(right.name, left, p)) - case p @ GreaterThan(left: NamedExpression, right: Literal) if !left.nullable = + case p @ GreaterThan(left: NamedExpression, right: Literal) = Some(createGreaterThanFilter(left.name, right, p)) - case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) if !right.nullable = + case p @ GreaterThanOrEqual(left: Literal, right: NamedExpression) = Some(createGreaterThanOrEqualFilter(right.name, left, p)) - case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) if !left.nullable = + case p @ GreaterThanOrEqual(left: NamedExpression, right: Literal) = --- End diff -- The nullable option is set when the field is optional. So adding tests for those. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2995 [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much. After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API. cc @mengxr You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2995.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2995 commit 731331fdafe9ce6e4bf24dc1e6667942e1e59587 Author: Davies Liu dav...@databricks.com Date: 2014-10-29T07:19:33Z simplify serialization in MLlib Python API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60883389 [Test build #22440 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22440/consoleFull) for PR 2992 at commit [`b99db6c`](https://github.com/apache/spark/commit/b99db6caa0a5f2d6e69d5940b5c37e88914c5e36). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60883548 [Test build #22441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22441/consoleFull) for PR 2615 at commit [`df2b19e`](https://github.com/apache/spark/commit/df2b19e1ef5bd462c02681d59d4fa4422c944ce4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2995#issuecomment-60883555 [Test build #22439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22439/consoleFull) for PR 2995 at commit [`731331f`](https://github.com/apache/spark/commit/731331fdafe9ce6e4bf24dc1e6667942e1e59587). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60883762 [Test build #22434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22434/consoleFull) for PR 2994 at commit [`f61d82f`](https://github.com/apache/spark/commit/f61d82f3152f8e6cf758fa349c8198289e0deae8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60883766 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22434/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4111 [MLlib] add regression metrics
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/2978#issuecomment-60883826 Rename parameter and function names to be consistent with spark naming rules. Delete unused columns and set prediction as the first column. Add explanation and reference to r2Score and explained variance. Other code style keeping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60883856 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22441/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60883853 [Test build #22441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22441/consoleFull) for PR 2615 at commit [`df2b19e`](https://github.com/apache/spark/commit/df2b19e1ef5bd462c02681d59d4fa4422c944ce4). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/2983#issuecomment-60884047 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60884111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22435/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
GitHub user jegonzal opened a pull request: https://github.com/apache/spark/pull/2996 [SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace This simple patch filters out extra whitespace entries. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jegonzal/spark loadLibSVM Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2996.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2996 commit e028e8443ff38e3617a1bdd0a2a3f5ec9b42d980 Author: Joseph E. Gonzalez joseph.e.gonza...@gmail.com Date: 2014-10-29T07:13:56Z fixing whitespace bug in loadLibSVMFile when parsing libSVM files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2994#issuecomment-60884103 [Test build #22435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22435/consoleFull) for PR 2994 at commit [`372c749`](https://github.com/apache/spark/commit/372c749458e22ba1a9acd2badbfad51e4dda3968). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2996#issuecomment-60884323 [Test build #22442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22442/consoleFull) for PR 2996 at commit [`e028e84`](https://github.com/apache/spark/commit/e028e8443ff38e3617a1bdd0a2a3f5ec9b42d980). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2983#issuecomment-60884309 [Test build #22443 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22443/consoleFull) for PR 2983 at commit [`4ca62cd`](https://github.com/apache/spark/commit/4ca62cd306d96890a7a56da04710ae5548715c4a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60884698 [Test build #22444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22444/consoleFull) for PR 2615 at commit [`472bbcf`](https://github.com/apache/spark/commit/472bbcfe5082920ac97bc2e29faeae78764141c7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885003 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22444/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885001 [Test build #22444 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22444/consoleFull) for PR 2615 at commit [`472bbcf`](https://github.com/apache/spark/commit/472bbcfe5082920ac97bc2e29faeae78764141c7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885128 Jenkins, retest this please. (I might get lucky this time, Looks like the compilation failure is random.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/2994#discussion_r19523356 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kafka + +import java.util.Properties + +import scala.reflect.ClassTag + +import kafka.producer.{ProducerConfig, KeyedMessage, Producer} + +import org.apache.spark.rdd.RDD +import org.apache.spark.streaming.dstream.DStream + +/** + * Import this object in this form: + * {{{ + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * }}} + * + * Once imported, the `writeToKafka` can be called on any [[DStream]] object in this form: + * {{{ + * dstream.writeToKafka(producerConfig, f) + * }}} + */ +object KafkaWriter { + import scala.language.implicitConversions + /** + * This implicit method allows the user to call dstream.writeToKafka(..) + * @param dstream - DStream to write to Kafka + * @tparam T - The type of the DStream + * @tparam K - The type of the key to serialize to + * @tparam V - The type of the value to serialize to + * @return + */ + implicit def createKafkaOutputWriter[T: ClassTag, K, V](dstream: DStream[T]): KafkaWriter[T] = { +new KafkaWriter[T](dstream) + } +} + +/** + * + * This class can be used to write data to Kafka from Spark Streaming. To write data to Kafka + * simply `import org.apache.spark.streaming.kafka.KafkaWriter._` in your application and call + * `dstream.writeToKafka(producerConf, func)` + * + * Here is an example: + * {{{ + * // Adding this line allows the user to call dstream.writeDStreamToKafka(..) + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * + * class ExampleWriter { + * val instream = ssc.queueStream(toBe) + * val producerConf = new Properties() + * producerConf.put(serializer.class, kafka.serializer.DefaultEncoder) + * producerConf.put(key.serializer.class, kafka.serializer.StringEncoder) + * producerConf.put(metadata.broker.list, kafka.example.com:5545) + * producerConf.put(request.required.acks, 1) + * instream.writeToKafka(producerConf, + *(x: String) = new KeyedMessage[String, String](default, null, x)) + * ssc.start() + * } + * + * }}} + * @param dstream - The [[DStream]] to be written to Kafka + * + */ +class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) { + + /** + * To write data from a DStream to Kafka, call this function after creating the DStream. Once + * the DStream is passed into this function, all data coming from the DStream is written out to + * Kafka. The properties instance takes the configuration required to connect to the Kafka + * brokers in the standard Kafka format. The serializerFunc is a function that converts each + * element of the RDD to a Kafka [[KeyedMessage]]. This closure should be serializable - so it + * should use only instances of Serializables. + * @param producerConfig The configuration that can be used to connect to Kafka + * @param serializerFunc The function to convert the data from the stream into Kafka + * [[KeyedMessage]]s. + * @tparam K The type of the key + * @tparam V The type of the value + * + */ + def writeToKafka[K, V](producerConfig: Properties, +serializerFunc: T = KeyedMessage[K, V]): Unit = { + +// Broadcast the producer to avoid sending it every time. +val broadcastedConfig = dstream.ssc.sc.broadcast(producerConfig) + +def func = (rdd: RDD[T]) = { + rdd.foreachPartition(events = { +// The ForEachDStream runs the function locally on the driver. So the +// ProducerCache from the driver is likely to get serialized and +//
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885269 [Test build #22445 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22445/consoleFull) for PR 2615 at commit [`472bbcf`](https://github.com/apache/spark/commit/472bbcfe5082920ac97bc2e29faeae78764141c7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885565 [Test build #22445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22445/consoleFull) for PR 2615 at commit [`472bbcf`](https://github.com/apache/spark/commit/472bbcfe5082920ac97bc2e29faeae78764141c7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60885568 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22445/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/2994#discussion_r19523702 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kafka + +import java.util.Properties + +import scala.reflect.ClassTag + +import kafka.producer.{ProducerConfig, KeyedMessage, Producer} + +import org.apache.spark.rdd.RDD +import org.apache.spark.streaming.dstream.DStream + +/** + * Import this object in this form: + * {{{ + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * }}} + * + * Once imported, the `writeToKafka` can be called on any [[DStream]] object in this form: + * {{{ + * dstream.writeToKafka(producerConfig, f) + * }}} + */ +object KafkaWriter { + import scala.language.implicitConversions + /** + * This implicit method allows the user to call dstream.writeToKafka(..) + * @param dstream - DStream to write to Kafka + * @tparam T - The type of the DStream + * @tparam K - The type of the key to serialize to + * @tparam V - The type of the value to serialize to + * @return + */ + implicit def createKafkaOutputWriter[T: ClassTag, K, V](dstream: DStream[T]): KafkaWriter[T] = { +new KafkaWriter[T](dstream) + } +} + +/** + * + * This class can be used to write data to Kafka from Spark Streaming. To write data to Kafka + * simply `import org.apache.spark.streaming.kafka.KafkaWriter._` in your application and call + * `dstream.writeToKafka(producerConf, func)` + * + * Here is an example: + * {{{ + * // Adding this line allows the user to call dstream.writeDStreamToKafka(..) + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * + * class ExampleWriter { + * val instream = ssc.queueStream(toBe) + * val producerConf = new Properties() + * producerConf.put(serializer.class, kafka.serializer.DefaultEncoder) + * producerConf.put(key.serializer.class, kafka.serializer.StringEncoder) + * producerConf.put(metadata.broker.list, kafka.example.com:5545) + * producerConf.put(request.required.acks, 1) + * instream.writeToKafka(producerConf, + *(x: String) = new KeyedMessage[String, String](default, null, x)) + * ssc.start() + * } + * + * }}} + * @param dstream - The [[DStream]] to be written to Kafka + * + */ +class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) { + + /** + * To write data from a DStream to Kafka, call this function after creating the DStream. Once + * the DStream is passed into this function, all data coming from the DStream is written out to + * Kafka. The properties instance takes the configuration required to connect to the Kafka + * brokers in the standard Kafka format. The serializerFunc is a function that converts each + * element of the RDD to a Kafka [[KeyedMessage]]. This closure should be serializable - so it + * should use only instances of Serializables. + * @param producerConfig The configuration that can be used to connect to Kafka + * @param serializerFunc The function to convert the data from the stream into Kafka + * [[KeyedMessage]]s. + * @tparam K The type of the key + * @tparam V The type of the value + * + */ + def writeToKafka[K, V](producerConfig: Properties, +serializerFunc: T = KeyedMessage[K, V]): Unit = { + +// Broadcast the producer to avoid sending it every time. +val broadcastedConfig = dstream.ssc.sc.broadcast(producerConfig) + +def func = (rdd: RDD[T]) = { + rdd.foreachPartition(events = { +// The ForEachDStream runs the function locally on the driver. So the +// ProducerCache from the driver is likely to get serialized and +//
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/2994#discussion_r19523728 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ --- End diff -- A empty line after Apache header :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2753#issuecomment-60886384 [Test build #22437 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22437/consoleFull) for PR 2753 at commit [`cadfd28`](https://github.com/apache/spark/commit/cadfd28f116f0dbca11e580a23caf82060bcf922). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2753#issuecomment-60886387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22437/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2994#discussion_r19523818 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ProducerCache.scala --- @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kafka + +object ProducerCache { + + private var producerOpt: Option[Any] = None --- End diff -- Isn't this going to share one Producer across the entire JVM, and potentially unrelated applications? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60887196 [Test build #22438 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22438/consoleFull) for PR 2991 at commit [`09d57c5`](https://github.com/apache/spark/commit/09d57c5270cf876e52d49db702e2330c2b6a6e10). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60887199 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22438/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2983#issuecomment-60887244 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22443/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2983#issuecomment-60887240 [Test build #22443 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22443/consoleFull) for PR 2983 at commit [`4ca62cd`](https://github.com/apache/spark/commit/4ca62cd306d96890a7a56da04710ae5548715c4a). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class UnscaledValue(child: Expression) extends UnaryExpression ` * `case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression ` * `case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)` * `case class PrecisionInfo(precision: Int, scale: Int)` * `case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType ` * `final class Decimal extends Ordered[Decimal] with Serializable ` * ` trait DecimalIsConflicted extends Numeric[Decimal] ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2995#issuecomment-60888203 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22439/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2995#issuecomment-60888200 [Test build #22439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22439/consoleFull) for PR 2995 at commit [`731331f`](https://github.com/apache/spark/commit/731331fdafe9ce6e4bf24dc1e6667942e1e59587). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class JavaModelWrapper(object):` * `class JavaVectorTransformer(JavaModelWrapper, VectorTransformer):` * `class StandardScalerModel(JavaVectorTransformer):` * `class IDFModel(JavaVectorTransformer):` * `class Word2VecModel(JavaVectorTransformer):` * `class MatrixFactorizationModel(JavaModelWrapper):` * `class MultivariateStatisticalSummary(JavaModelWrapper):` * `class DecisionTreeModel(JavaModelWrapper):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2996#issuecomment-60888681 [Test build #22442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22442/consoleFull) for PR 2996 at commit [`e028e84`](https://github.com/apache/spark/commit/e028e8443ff38e3617a1bdd0a2a3f5ec9b42d980). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4130][MLlib] Fixing libSVM parser bug w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2996#issuecomment-60888689 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22442/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60889264 [Test build #22440 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22440/consoleFull) for PR 2992 at commit [`b99db6c`](https://github.com/apache/spark/commit/b99db6caa0a5f2d6e69d5940b5c37e88914c5e36). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4129][MLlib] Performance tuning in Mult...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2992#issuecomment-60889271 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22440/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60889710 **[Test build #22436 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22436/consoleFull)** for PR 2615 at commit [`95c2e8e`](https://github.com/apache/spark/commit/95c2e8e86f69f3b1d11aad04f6ad14b0ede1950a) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SPARK-1812] Adjust build system and tests to...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2615#issuecomment-60889716 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22436/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Branch 1.1
Github user huozhanfeng commented on the pull request: https://github.com/apache/spark/pull/1824#issuecomment-60890031 @rxin Iâm sorry. This is a wrong operation and thanks for your help to close it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4131][SQL] Writing data into the f...
GitHub user wangxiaojing opened a pull request: https://github.com/apache/spark/pull/2997 [WIP][SPARK-4131][SQL] Writing data into the filesystem from queries You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangxiaojing/spark SPARK-4131 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2997.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2997 commit f406f49749d590f230c953194a8c36e760fc9460 Author: wangxiaojing u9j...@gmail.com Date: 2014-10-29T08:58:19Z add Token --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60892857 [Test build #22446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22446/consoleFull) for PR 2991 at commit [`bb57d05`](https://github.com/apache/spark/commit/bb57d05b2e3579c9c3e59429918082937a99e87f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4131][SQL] Writing data into the f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2997#issuecomment-60893115 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527141 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowDefMap = new ConcurrentHashMap[Long,Map[String, Seq[ASTNode]]]() + + // store the window spec of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527153 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527190 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowDefMap = new ConcurrentHashMap[Long,Map[String, Seq[ASTNode]]]() + + // store the window spec of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowPartitionsMap = new ConcurrentHashMap[Long, ArrayBuffer[Node]]() + + protected def initWindow() = { +windowDefMap.put(Thread.currentThread().getId, Map[String, Seq[ASTNode]]()) +windowPartitionsMap.put(Thread.currentThread().getId, new ArrayBuffer[Node]()) + } + protected def checkWindowDef(windowClause: Option[Node]) = { + +var winDefs = windowDefMap.get(Thread.currentThread().getId) + +windowClause match { + case Some(window) = window.getChildren.foreach { +case Token(TOK_WINDOWDEF, Token(alias, Nil) :: Token(TOK_WINDOWSPEC, ws) :: Nil) = { + winDefs += alias - ws +} + } + case None = //do nothing +} + +windowDefMap.put(Thread.currentThread().getId, winDefs) + } + + protected def translateWindowSpec(windowSpec: Seq[ASTNode]): Seq[ASTNode]= { + +windowSpec match { + case Token(alias, Nil) :: Nil = translateWindowSpec(getWindowSpec(alias)) + case Token(alias, Nil) :: range = { +val (partitionClause :: rowsRange :: valueRange :: Nil) = getClauses( + Seq( +TOK_PARTITIONINGSPEC, +TOK_WINDOWRANGE, +TOK_WINDOWVALUES), + translateWindowSpec(getWindowSpec(alias))) +partitionClause match { + case Some(partition) = partition.asInstanceOf[ASTNode] :: range + case None = range +} + } + case e = e +} + } + + protected def getWindowSpec(alias: String): Seq[ASTNode]= { +windowDefMap.get(Thread.currentThread().getId).getOrElse( + alias, sys.error(no window def for + alias)) + } + + protected def addWindowPartitions(partition: Node) = { + +var winPartitions = windowPartitionsMap.get(Thread.currentThread().getId) +winPartitions += partition +windowPartitionsMap.put(Thread.currentThread().getId, winPartitions) + } + + protected def getWindowPartitions(): Seq[Node]= { +windowPartitionsMap.get(Thread.currentThread().getId).toSeq + } + + protected def checkWindowPartitions(): Option[Seq[ASTNode]] = { + +val partitionUnits = new ArrayBuffer[Seq[ASTNode]]() + +getWindowPartitions.map { + case Token(TOK_PARTITIONINGSPEC, partition) = Some(partition) + case _ = None +}.foreach { + case Some(partition) = { +if (partitionUnits.isEmpty) partitionUnits += partition +else { + //only add different window partitions + try { +partition zip partitionUnits.head foreach { + case (l,r) = l checkEquals r +} + } catch { +case re: RuntimeException = partitionUnits += partition + } +} + } + case None = //do nothing --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527163 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowDefMap = new ConcurrentHashMap[Long,Map[String, Seq[ASTNode]]]() + + // store the window spec of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowPartitionsMap = new ConcurrentHashMap[Long, ArrayBuffer[Node]]() + + protected def initWindow() = { +windowDefMap.put(Thread.currentThread().getId, Map[String, Seq[ASTNode]]()) +windowPartitionsMap.put(Thread.currentThread().getId, new ArrayBuffer[Node]()) + } + protected def checkWindowDef(windowClause: Option[Node]) = { + +var winDefs = windowDefMap.get(Thread.currentThread().getId) + +windowClause match { + case Some(window) = window.getChildren.foreach { +case Token(TOK_WINDOWDEF, Token(alias, Nil) :: Token(TOK_WINDOWSPEC, ws) :: Nil) = { + winDefs += alias - ws +} + } + case None = //do nothing --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527181 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowDefMap = new ConcurrentHashMap[Long,Map[String, Seq[ASTNode]]]() + + // store the window spec of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowPartitionsMap = new ConcurrentHashMap[Long, ArrayBuffer[Node]]() + + protected def initWindow() = { +windowDefMap.put(Thread.currentThread().getId, Map[String, Seq[ASTNode]]()) +windowPartitionsMap.put(Thread.currentThread().getId, new ArrayBuffer[Node]()) + } + protected def checkWindowDef(windowClause: Option[Node]) = { + +var winDefs = windowDefMap.get(Thread.currentThread().getId) + +windowClause match { + case Some(window) = window.getChildren.foreach { +case Token(TOK_WINDOWDEF, Token(alias, Nil) :: Token(TOK_WINDOWSPEC, ws) :: Nil) = { + winDefs += alias - ws +} + } + case None = //do nothing +} + +windowDefMap.put(Thread.currentThread().getId, winDefs) + } + + protected def translateWindowSpec(windowSpec: Seq[ASTNode]): Seq[ASTNode]= { + +windowSpec match { + case Token(alias, Nil) :: Nil = translateWindowSpec(getWindowSpec(alias)) + case Token(alias, Nil) :: range = { +val (partitionClause :: rowsRange :: valueRange :: Nil) = getClauses( + Seq( +TOK_PARTITIONINGSPEC, +TOK_WINDOWRANGE, +TOK_WINDOWVALUES), + translateWindowSpec(getWindowSpec(alias))) +partitionClause match { + case Some(partition) = partition.asInstanceOf[ASTNode] :: range + case None = range +} + } + case e = e +} + } + + protected def getWindowSpec(alias: String): Seq[ASTNode]= { +windowDefMap.get(Thread.currentThread().getId).getOrElse( + alias, sys.error(no window def for + alias)) + } + + protected def addWindowPartitions(partition: Node) = { + +var winPartitions = windowPartitionsMap.get(Thread.currentThread().getId) +winPartitions += partition +windowPartitionsMap.put(Thread.currentThread().getId, winPartitions) + } + + protected def getWindowPartitions(): Seq[Node]= { +windowPartitionsMap.get(Thread.currentThread().getId).toSeq + } + + protected def checkWindowPartitions(): Option[Seq[ASTNode]] = { + +val partitionUnits = new ArrayBuffer[Seq[ASTNode]]() + +getWindowPartitions.map { + case Token(TOK_PARTITIONINGSPEC, partition) = Some(partition) + case _ = None +}.foreach { + case Some(partition) = { +if (partitionUnits.isEmpty) partitionUnits += partition +else { + //only add different window partitions --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527196 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -845,6 +858,198 @@ private[hive] object HiveQl { throw new NotImplementedError(sNo parse rules for:\n ${dumpTree(a).toString} ) } + // store the window def of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowDefMap = new ConcurrentHashMap[Long,Map[String, Seq[ASTNode]]]() + + // store the window spec of current sql + //use thread id as key to avoid mistake when muti sqls parse at the same time + protected val windowPartitionsMap = new ConcurrentHashMap[Long, ArrayBuffer[Node]]() + + protected def initWindow() = { +windowDefMap.put(Thread.currentThread().getId, Map[String, Seq[ASTNode]]()) +windowPartitionsMap.put(Thread.currentThread().getId, new ArrayBuffer[Node]()) + } + protected def checkWindowDef(windowClause: Option[Node]) = { + +var winDefs = windowDefMap.get(Thread.currentThread().getId) + +windowClause match { + case Some(window) = window.getChildren.foreach { +case Token(TOK_WINDOWDEF, Token(alias, Nil) :: Token(TOK_WINDOWSPEC, ws) :: Nil) = { + winDefs += alias - ws +} + } + case None = //do nothing +} + +windowDefMap.put(Thread.currentThread().getId, winDefs) + } + + protected def translateWindowSpec(windowSpec: Seq[ASTNode]): Seq[ASTNode]= { + +windowSpec match { + case Token(alias, Nil) :: Nil = translateWindowSpec(getWindowSpec(alias)) + case Token(alias, Nil) :: range = { +val (partitionClause :: rowsRange :: valueRange :: Nil) = getClauses( + Seq( +TOK_PARTITIONINGSPEC, +TOK_WINDOWRANGE, +TOK_WINDOWVALUES), + translateWindowSpec(getWindowSpec(alias))) +partitionClause match { + case Some(partition) = partition.asInstanceOf[ASTNode] :: range + case None = range +} + } + case e = e +} + } + + protected def getWindowSpec(alias: String): Seq[ASTNode]= { +windowDefMap.get(Thread.currentThread().getId).getOrElse( + alias, sys.error(no window def for + alias)) + } + + protected def addWindowPartitions(partition: Node) = { + +var winPartitions = windowPartitionsMap.get(Thread.currentThread().getId) +winPartitions += partition +windowPartitionsMap.put(Thread.currentThread().getId, winPartitions) + } + + protected def getWindowPartitions(): Seq[Node]= { +windowPartitionsMap.get(Thread.currentThread().getId).toSeq + } + + protected def checkWindowPartitions(): Option[Seq[ASTNode]] = { + +val partitionUnits = new ArrayBuffer[Seq[ASTNode]]() + +getWindowPartitions.map { + case Token(TOK_PARTITIONINGSPEC, partition) = Some(partition) + case _ = None +}.foreach { + case Some(partition) = { +if (partitionUnits.isEmpty) partitionUnits += partition +else { + //only add different window partitions + try { +partition zip partitionUnits.head foreach { + case (l,r) = l checkEquals r +} + } catch { +case re: RuntimeException = partitionUnits += partition + } +} + } + case None = //do nothing +} + +//check whether all window partitions are same, we just support same window partition now --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2685#issuecomment-60896714 A summary of what I've found with local testing: - For master branch: - Check out a fresh copy, build the assembly jar first, then run `HashShuffleSuite`: pass - Check out a fresh copy, run `HashShuffleSuite` directly without building assembly jar: fail - For this PR: Both approaches fail. So I guess the problem should be related to Maven configurations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527811 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WindowFunction.scala --- @@ -0,0 +1,353 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import java.util.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.physical.AllTuples +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.catalyst.errors._ +import scala.collection.mutable.ArrayBuffer +import org.apache.spark.util.collection.CompactBuffer +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection +import org.apache.spark.sql.catalyst.expressions.Alias +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.SortPartitions + + +/** + * :: DeveloperApi :: + * Groups input data by `partitionExpressions` and computes the `computeExpressions` for each + * group. + * @param partitionExpressions expressions that are evaluated to determine partition. + * @param functionExpressions expressions that are computed for each partition. + * @param child the input data source. + */ +@DeveloperApi +case class WindowFunction( + partitionExpressions: Seq[Expression], + functionExpressions: Seq[NamedExpression], + child: SparkPlan) + extends UnaryNode { + + override def requiredChildDistribution = +if (partitionExpressions == Nil) { + AllTuples :: Nil +} else { + ClusteredDistribution(partitionExpressions) :: Nil +} + + // HACK: Generators don't correctly preserve their output through serializations so we grab + // out child's output attributes statically here. + private[this] val childOutput = child.output + + override def output = functionExpressions.map(_.toAttribute) + + /** A list of functions that need to be computed for each partition. */ + private[this] val computeExpressions = new ArrayBuffer[AggregateExpression] + + private[this] val otherExpressions = new ArrayBuffer[NamedExpression] + + functionExpressions.foreach { sel = +sel.collect { + case func: AggregateExpression = computeExpressions += func + case other: NamedExpression if (!other.isInstanceOf[Alias]) = otherExpressions += other +} + } + + private[this] val functionAttributes = computeExpressions.map { func = +func - AttributeReference(sfuncResult:$func, func.dataType, func.nullable)()} + + /** The schema of the result of all evaluations */ + private[this] val resultAttributes = +otherExpressions.map(_.toAttribute) ++ functionAttributes.map(_._2) + + private[this] val resultMap = +(otherExpressions.map { other = other - other.toAttribute } ++ functionAttributes +).toMap + + + private[this] val resultExpressions = functionExpressions.map { sel = +sel.transform { + case e: Expression if resultMap.contains(e) = resultMap(e) +} + } + + private[this] val sortExpressions = +if (child.isInstanceOf[SortPartitions]) { + child.asInstanceOf[SortPartitions].sortExpressions +} +else if (child.isInstanceOf[Sort]) { + child.asInstanceOf[Sort].sortOrder +} +else null + + /** Creates a new function buffer for a partition. */ + private[this] def newFunctionBuffer(): Array[AggregateFunction] = { +val buffer = new
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WindowFunction.scala --- @@ -0,0 +1,353 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import java.util.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.physical.AllTuples +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.catalyst.errors._ +import scala.collection.mutable.ArrayBuffer +import org.apache.spark.util.collection.CompactBuffer +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.catalyst.expressions.AttributeReference +import org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection +import org.apache.spark.sql.catalyst.expressions.Alias +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.plans.logical.SortPartitions + + +/** + * :: DeveloperApi :: + * Groups input data by `partitionExpressions` and computes the `computeExpressions` for each + * group. + * @param partitionExpressions expressions that are evaluated to determine partition. + * @param functionExpressions expressions that are computed for each partition. + * @param child the input data source. + */ +@DeveloperApi +case class WindowFunction( + partitionExpressions: Seq[Expression], + functionExpressions: Seq[NamedExpression], + child: SparkPlan) + extends UnaryNode { + + override def requiredChildDistribution = +if (partitionExpressions == Nil) { + AllTuples :: Nil +} else { + ClusteredDistribution(partitionExpressions) :: Nil +} + + // HACK: Generators don't correctly preserve their output through serializations so we grab + // out child's output attributes statically here. + private[this] val childOutput = child.output + + override def output = functionExpressions.map(_.toAttribute) + + /** A list of functions that need to be computed for each partition. */ + private[this] val computeExpressions = new ArrayBuffer[AggregateExpression] + + private[this] val otherExpressions = new ArrayBuffer[NamedExpression] + + functionExpressions.foreach { sel = +sel.collect { + case func: AggregateExpression = computeExpressions += func + case other: NamedExpression if (!other.isInstanceOf[Alias]) = otherExpressions += other +} + } + + private[this] val functionAttributes = computeExpressions.map { func = +func - AttributeReference(sfuncResult:$func, func.dataType, func.nullable)()} + + /** The schema of the result of all evaluations */ + private[this] val resultAttributes = +otherExpressions.map(_.toAttribute) ++ functionAttributes.map(_._2) + + private[this] val resultMap = +(otherExpressions.map { other = other - other.toAttribute } ++ functionAttributes +).toMap + + + private[this] val resultExpressions = functionExpressions.map { sel = +sel.transform { + case e: Expression if resultMap.contains(e) = resultMap(e) +} + } + + private[this] val sortExpressions = +if (child.isInstanceOf[SortPartitions]) { + child.asInstanceOf[SortPartitions].sortExpressions +} +else if (child.isInstanceOf[Sort]) { + child.asInstanceOf[Sort].sortOrder +} +else null + + /** Creates a new function buffer for a partition. */ + private[this] def newFunctionBuffer(): Array[AggregateFunction] = { +val buffer = new
[GitHub] spark pull request: [SPARK-1442] [SQL] window function implement
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/2953#discussion_r19527873 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveWindowFunctionSuite.scala --- @@ -0,0 +1,314 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.hive._ +import org.apache.spark.sql.hive.test.TestHive +import org.apache.spark.sql.hive.test.TestHive._ +import org.apache.spark.sql.{Row, SchemaRDD} + +class HiveWindowFunctionSuite extends HiveComparisonTest { + + override def beforeAll() { +sql(DROP TABLE IF EXISTS part).collect() + +sql( +|CREATE TABLE part( +|p_partkey INT, +|p_name STRING, +|p_mfgr STRING, +|p_brand STRING, +|p_type STRING, +|p_size INT, +|p_container STRING, +|p_retailprice DOUBLE, +|p_comment STRING +|) + .stripMargin).collect() + +//remove duplicate data in part_tiny.txt for hive bug --- End diff -- Space after // --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/2841#issuecomment-60897965 Added more tests for filtering on nullable columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3611] Show number of cores for each exe...
Github user devldevelopment commented on the pull request: https://github.com/apache/spark/pull/2980#issuecomment-60898133 Thanks for the feedback guys, updated with changes. Removed the WIP as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3611] Show number of cores for each exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2980#issuecomment-60898397 [Test build #22447 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22447/consoleFull) for PR 2980 at commit [`50a1592`](https://github.com/apache/spark/commit/50a15921576862ae99df5542f2e4d3bb36253d1b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2841#issuecomment-60898409 [Test build #22448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22448/consoleFull) for PR 2841 at commit [`8282ba0`](https://github.com/apache/spark/commit/8282ba0752951fc9d7a7593c6f6c89815bb92b3a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2850#issuecomment-60900756 [Test build #22449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22449/consoleFull) for PR 2850 at commit [`4c4292c`](https://github.com/apache/spark/commit/4c4292ccbd23cac1cb511ca9581bb9ac17ac037f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60901164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22446/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4062][Streaming]Add ReliableKafkaReceiv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2991#issuecomment-60901153 [Test build #22446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22446/consoleFull) for PR 2991 at commit [`bb57d05`](https://github.com/apache/spark/commit/bb57d05b2e3579c9c3e59429918082937a99e87f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4003] [SQL] add 3 types for java SQL co...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2850#issuecomment-60906435 [Test build #22449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22449/consoleFull) for PR 2850 at commit [`4c4292c`](https://github.com/apache/spark/commit/4c4292ccbd23cac1cb511ca9581bb9ac17ac037f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org