[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/11751#issuecomment-197164762 LGTM assuming that MiMa checks pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284643 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala --- @@ -81,8 +81,9 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } + // TODO(josh): assumption that there is only one chunk here is a hack val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( -serializedTaskResult.get) +serializedTaskResult.get.getChunks().head) --- End diff -- This is a messy temporary hack caused by some interface mismatching; I'll see about fixing this so that we don't make any assumptions about there only being one chunk here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13281][CORE] Switch broadcast of RDD to...
Github user breakdawn commented on the pull request: https://github.com/apache/spark/pull/11735#issuecomment-197163774 Is it ok to test, or is there any action i can follow up? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13316][Streaming]add check to avoid reg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11753#issuecomment-197163615 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53275/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13316][Streaming]add check to avoid reg...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11753#issuecomment-197163614 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13316][Streaming]add check to avoid reg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11753#issuecomment-197163501 **[Test build #53275 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53275/consoleFull)** for PR 11753 at commit [`51a0399`](https://github.com/apache/spark/commit/51a03994ae3f66fa80b41a488f27b3fe94abe0ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284364 --- Diff: core/src/test/scala/org/apache/spark/io/ChunkedByteBufferSuite.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.io + +import java.nio.ByteBuffer + +import com.google.common.io.ByteStreams + +import org.apache.spark.SparkFunSuite +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.util.io.ChunkedByteBuffer + +class ChunkedByteBufferSuite extends SparkFunSuite { + + test("must have at least one chunk") { --- End diff -- I think it marginally simplified some initialization code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284334 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { +if (limit >= Integer.MAX_VALUE) { + throw new UnsupportedOperationException( +s"cannot call toArray because buffer size ($limit bytes) exceeds maximum array size") +} +val byteChannel = new ByteArrayWritableChannel(limit.toInt) +writeFully(byteChannel) +byteChannel.close() +byteChannel.getData + } + + def toByteBuffer: ByteBuffer = { +if (chunks.length == 1) { + chunks.head +} else { + ByteBuffer.wrap(toArray) +} + } + + def toInputStream(dispose: Boolean = false): InputStream = { +new ChunkedByteBufferInputStream(this, dispose) + } + + def getChunks(): Array[ByteBuffer] = { +chunks.map(_.duplicate()) + } + + def copy(): ChunkedByteBuffer = { +val copiedChunks = getChunks().map { chunk => + // TODO: accept an allocator in this copy method to integrate with mem. accounting systems + val newChunk = ByteBuffer.allocate(chunk.limit()) + newChunk.put(chunk) + newChunk.flip() + newChunk +} +new ChunkedByteBuffer(copiedChunks) + } + + /** + * Attempt to clean up a ByteBuffer if it is memory-mapped. This uses an *unsafe* Sun API that + * might cause errors if one attempts to read from the unmapped buffer, but it's better than + * waiting for the GC to find it because that could lead to huge numbers of open files. There's + * unfortunately no standard API to do this. + */ + def dispose(): Unit = { +chunks.foreach(BlockManager.dispose) + } +} + +/** + * Reads data from a ChunkedByteBuffer, and optionally cleans it up using BlockManager.dispose() + * at the end of the stream (e.g. to close a memory-mapped file). + */ +private class ChunkedByteBufferInputStream( +var chunkedByteBuffer: ChunkedByteBuffer, +dispose: Boolean) --- End diff -- Yep, will do. The semantics of `dispose()` are a bit confusing given some other weird semantics surrounding its use elsewhere in the BlockManager. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user dongjoon-hyun commented on the pull request: https://github.com/apache/spark/pull/11751#issuecomment-197162819 @JoshRosen , I removed the legacy code and related comments now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284289 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala --- @@ -81,8 +81,9 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } + // TODO(josh): assumption that there is only one chunk here is a hack val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( -serializedTaskResult.get) +serializedTaskResult.get.getChunks().head) --- End diff -- i guess it is because on the write side we enforced it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284273 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala --- @@ -81,8 +81,9 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } + // TODO(josh): assumption that there is only one chunk here is a hack val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( -serializedTaskResult.get) +serializedTaskResult.get.getChunks().head) --- End diff -- this is not safe to do is it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284277 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { +if (limit >= Integer.MAX_VALUE) { + throw new UnsupportedOperationException( +s"cannot call toArray because buffer size ($limit bytes) exceeds maximum array size") +} +val byteChannel = new ByteArrayWritableChannel(limit.toInt) +writeFully(byteChannel) +byteChannel.close() +byteChannel.getData + } + + def toByteBuffer: ByteBuffer = { +if (chunks.length == 1) { + chunks.head +} else { + ByteBuffer.wrap(toArray) +} + } + + def toInputStream(dispose: Boolean = false): InputStream = { +new ChunkedByteBufferInputStream(this, dispose) + } + + def getChunks(): Array[ByteBuffer] = { +chunks.map(_.duplicate()) + } + + def copy(): ChunkedByteBuffer = { +val copiedChunks = getChunks().map { chunk => + // TODO: accept an allocator in this copy method to integrate with mem. accounting systems + val newChunk = ByteBuffer.allocate(chunk.limit()) + newChunk.put(chunk) + newChunk.flip() + newChunk +} +new ChunkedByteBuffer(copiedChunks) + } + + /** + * Attempt to clean up a ByteBuffer if it is memory-mapped. This uses an *unsafe* Sun API that + * might cause errors if one attempts to read from the unmapped buffer, but it's better than + * waiting for the GC to find it because that could lead to huge numbers of open files. There's + * unfortunately no standard API to do this. + */ + def dispose(): Unit = { +chunks.foreach(BlockManager.dispose) --- End diff -- Yeah, we should definitely move it into a different companion object or utilities class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13885][YARN] Fix attempt id regression ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/11721#discussion_r56284175 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -374,12 +374,6 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli throw new SparkException("An application name must be set in your configuration") } -// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster -if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) { --- End diff -- I see, removing this will still get other error messages if trying to run cluster mode with such way, but maybe a little odd as you mentioned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11751#issuecomment-197161777 **[Test build #53279 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53279/consoleFull)** for PR 11751 at commit [`16121b0`](https://github.com/apache/spark/commit/16121b04daff0e4db86bcd2ff2f1b904b067f23d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284073 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { +if (limit >= Integer.MAX_VALUE) { + throw new UnsupportedOperationException( +s"cannot call toArray because buffer size ($limit bytes) exceeds maximum array size") +} +val byteChannel = new ByteArrayWritableChannel(limit.toInt) +writeFully(byteChannel) +byteChannel.close() +byteChannel.getData + } + + def toByteBuffer: ByteBuffer = { +if (chunks.length == 1) { + chunks.head +} else { + ByteBuffer.wrap(toArray) +} + } + + def toInputStream(dispose: Boolean = false): InputStream = { +new ChunkedByteBufferInputStream(this, dispose) + } + + def getChunks(): Array[ByteBuffer] = { +chunks.map(_.duplicate()) + } + + def copy(): ChunkedByteBuffer = { +val copiedChunks = getChunks().map { chunk => + // TODO: accept an allocator in this copy method to integrate with mem. accounting systems + val newChunk = ByteBuffer.allocate(chunk.limit()) + newChunk.put(chunk) + newChunk.flip() + newChunk +} +new ChunkedByteBuffer(copiedChunks) + } + + /** + * Attempt to clean up a ByteBuffer if it is memory-mapped. This uses an *unsafe* Sun API that + * might cause errors if one attempts to read from the unmapped buffer, but it's better than + * waiting for the GC to find it because that could lead to huge numbers of open files. There's + * unfortunately no standard API to do this. + */ + def dispose(): Unit = { +chunks.foreach(BlockManager.dispose) --- End diff -- maybe move BlockManager.dispose into an util and call it there? we reduces some coupling this way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56284010 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { +if (limit >= Integer.MAX_VALUE) { + throw new UnsupportedOperationException( +s"cannot call toArray because buffer size ($limit bytes) exceeds maximum array size") +} +val byteChannel = new ByteArrayWritableChannel(limit.toInt) +writeFully(byteChannel) +byteChannel.close() +byteChannel.getData + } + + def toByteBuffer: ByteBuffer = { +if (chunks.length == 1) { + chunks.head +} else { + ByteBuffer.wrap(toArray) +} + } + + def toInputStream(dispose: Boolean = false): InputStream = { +new ChunkedByteBufferInputStream(this, dispose) + } + + def getChunks(): Array[ByteBuffer] = { +chunks.map(_.duplicate()) + } + + def copy(): ChunkedByteBuffer = { +val copiedChunks = getChunks().map { chunk => + // TODO: accept an allocator in this copy method to integrate with mem. accounting systems + val newChunk = ByteBuffer.allocate(chunk.limit()) + newChunk.put(chunk) + newChunk.flip() + newChunk +} +new ChunkedByteBuffer(copiedChunks) + } + + /** + * Attempt to clean up a ByteBuffer if it is memory-mapped. This uses an *unsafe* Sun API that + * might cause errors if one attempts to read from the unmapped buffer, but it's better than + * waiting for the GC to find it because that could lead to huge numbers of open files. There's + * unfortunately no standard API to do this. + */ + def dispose(): Unit = { +chunks.foreach(BlockManager.dispose) + } +} + +/** + * Reads data from a ChunkedByteBuffer, and optionally cleans it up using BlockManager.dispose() + * at the end of the stream (e.g. to close a memory-mapped file). + */ +private class ChunkedByteBufferInputStream( +var chunkedByteBuffer: ChunkedByteBuffer, +dispose: Boolean) --- End diff -- never mind the close part makes sense. just document dispose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56283999 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { --- End diff -- same for toByteBuffer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56283991 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { --- End diff -- should document this throws exceptions if size doesn't fit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13894][SQL] SqlContext.range return typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11730#issuecomment-197161220 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53266/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56283895 --- Diff: core/src/test/scala/org/apache/spark/io/ChunkedByteBufferSuite.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.io + +import java.nio.ByteBuffer + +import com.google.common.io.ByteStreams + +import org.apache.spark.SparkFunSuite +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.util.io.ChunkedByteBuffer + +class ChunkedByteBufferSuite extends SparkFunSuite { + + test("must have at least one chunk") { --- End diff -- any reason we want to enforce this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13894][SQL] SqlContext.range return typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11730#issuecomment-197161216 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13921] Store serialized blocks as multi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11748#discussion_r56283941 --- Diff: core/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala --- @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.io + +import java.io.InputStream +import java.nio.ByteBuffer +import java.nio.channels.WritableByteChannel + +import com.google.common.primitives.UnsignedBytes +import io.netty.buffer.{ByteBuf, Unpooled} + +import org.apache.spark.network.util.ByteArrayWritableChannel +import org.apache.spark.storage.BlockManager + +private[spark] class ChunkedByteBuffer(var chunks: Array[ByteBuffer]) { + require(chunks != null, "chunks must not be null") + require(chunks.nonEmpty, "Cannot create a ChunkedByteBuffer with no chunks") + require(chunks.forall(_.limit() > 0), "chunks must be non-empty") + require(chunks.forall(_.position() == 0), "chunks' positions must be 0") + + val limit: Long = chunks.map(_.limit().asInstanceOf[Long]).sum + + def this(byteBuffer: ByteBuffer) = { +this(Array(byteBuffer)) + } + + def writeFully(channel: WritableByteChannel): Unit = { +for (bytes <- getChunks()) { + while (bytes.remaining > 0) { +channel.write(bytes) + } +} + } + + def toNetty: ByteBuf = { +Unpooled.wrappedBuffer(getChunks(): _*) + } + + def toArray: Array[Byte] = { +if (limit >= Integer.MAX_VALUE) { + throw new UnsupportedOperationException( +s"cannot call toArray because buffer size ($limit bytes) exceeds maximum array size") +} +val byteChannel = new ByteArrayWritableChannel(limit.toInt) +writeFully(byteChannel) +byteChannel.close() +byteChannel.getData + } + + def toByteBuffer: ByteBuffer = { +if (chunks.length == 1) { + chunks.head +} else { + ByteBuffer.wrap(toArray) +} + } + + def toInputStream(dispose: Boolean = false): InputStream = { +new ChunkedByteBufferInputStream(this, dispose) + } + + def getChunks(): Array[ByteBuffer] = { +chunks.map(_.duplicate()) + } + + def copy(): ChunkedByteBuffer = { +val copiedChunks = getChunks().map { chunk => + // TODO: accept an allocator in this copy method to integrate with mem. accounting systems + val newChunk = ByteBuffer.allocate(chunk.limit()) + newChunk.put(chunk) + newChunk.flip() + newChunk +} +new ChunkedByteBuffer(copiedChunks) + } + + /** + * Attempt to clean up a ByteBuffer if it is memory-mapped. This uses an *unsafe* Sun API that + * might cause errors if one attempts to read from the unmapped buffer, but it's better than + * waiting for the GC to find it because that could lead to huge numbers of open files. There's + * unfortunately no standard API to do this. + */ + def dispose(): Unit = { +chunks.foreach(BlockManager.dispose) + } +} + +/** + * Reads data from a ChunkedByteBuffer, and optionally cleans it up using BlockManager.dispose() + * at the end of the stream (e.g. to close a memory-mapped file). + */ +private class ChunkedByteBufferInputStream( +var chunkedByteBuffer: ChunkedByteBuffer, +dispose: Boolean) --- End diff -- we should document what dispose means. does it mean releasing the buffer upon close? why would we ever want to do that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request: [SPARK-13894][SQL] SqlContext.range return typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11730#issuecomment-197160978 **[Test build #53266 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53266/consoleFull)** for PR 11730 at commit [`31967ac`](https://github.com/apache/spark/commit/31967ac4a97174c11706930e09a509b08cdc7d09). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11694 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13904][Scheduler]Add support for plugga...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11723#issuecomment-197159617 Hmm I'm not sure if it makes sense to create a public API for this at this point. This is directly exposing something that is very internal to the current implementation of Spark (SchedulerBackend), and these are by no means stable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197159610 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/11751#discussion_r56283381 --- Diff: tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala --- @@ -39,12 +39,14 @@ import org.clapper.classutil.ClassFinder object GenerateMIMAIgnore { private val classLoader = Thread.currentThread().getContextClassLoader private val mirror = runtimeMirror(classLoader) + // SPARK-13920: MIMA checks should apply to @Experimental and @DeveloperAPI APIs + private val ignoreExpDevApi = false --- End diff -- Oh, I see. I'll remove all then. Thank you for fast review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197157889 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197157899 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53268/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13917] [SQL] generate broadcast semi jo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11742 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197157762 **[Test build #53268 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53268/consoleFull)** for PR 11645 at commit [`8123818`](https://github.com/apache/spark/commit/81238189f560f5df846b36e5cb5e50532f6f82a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13917] [SQL] generate broadcast semi jo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11742#issuecomment-197157622 **[Test build #53278 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53278/consoleFull)** for PR 11742 at commit [`f1ca6f2`](https://github.com/apache/spark/commit/f1ca6f250d85ee9ba9fcdbb308486588a527eb87). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/11743#issuecomment-197157236 Will do that when implement codegen support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11751#discussion_r56282812 --- Diff: tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala --- @@ -39,12 +39,14 @@ import org.clapper.classutil.ClassFinder object GenerateMIMAIgnore { private val classLoader = Thread.currentThread().getContextClassLoader private val mirror = runtimeMirror(classLoader) + // SPARK-13920: MIMA checks should apply to @Experimental and @DeveloperAPI APIs + private val ignoreExpDevApi = false --- End diff -- Rather than adding an on/off switch, I'd just go ahead and completely remove these methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13885][YARN] Fix attempt id regression ...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11721#issuecomment-197156547 I think `SchedulerExtensionService` can still be worked, since I still maintain the full attempt id in `YarnScheduler`, only change to the simple counter for Spark scheduler. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [core] [yarn] Add type-safe config...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10205#issuecomment-197156120 @vanzin I finally had some time to look over this change. I like the direction this is going to apply more semantics for configs, but I find the pre-existing SQLConf much easier to understand as its class structure is much simpler. A few questions/comments: - What's the difference between OptionalConfigEntry and something that has default value null? I only find OptionalConfigEntry used in the setter, but nothing specific to a getter. - Why is the builder pattern better than just a ctor with default argument values? - Function naming is inconsistent. E.g. "withDefault" and "doc" vs "withDoc" - The convention is for state mutation functions (e.g. internal) to have parentheses. Right now they look like some variable that's being accessed. - Do we really need all these classes? Even if we use the builder pattern, I'd imagine we can live with only two classes, a builder and a config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13903][SQL] Modify output nullability w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11722#issuecomment-197155618 **[Test build #53277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53277/consoleFull)** for PR 11722 at commit [`93a73b7`](https://github.com/apache/spark/commit/93a73b79fbeee9c1bd722f5aa66257409d3c2512). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13904][Scheduler]Add support for plugga...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11723#issuecomment-197154731 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13904][Scheduler]Add support for plugga...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11723#issuecomment-197154732 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53264/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13904][Scheduler]Add support for plugga...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11723#issuecomment-197154650 **[Test build #53264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53264/consoleFull)** for PR 11723 at commit [`800834f`](https://github.com/apache/spark/commit/800834f24ad1f0c4a68d8d49f600db6570d100ef). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ExternalClusterManager ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11720#issuecomment-197154174 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11720#issuecomment-197154175 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53265/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13902][SCHEDULER] Make DAGScheduler.get...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11720#issuecomment-197154079 **[Test build #53265 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53265/consoleFull)** for PR 11720 at commit [`0ea3fc8`](https://github.com/apache/spark/commit/0ea3fc838f689729794b6ea3aaf0b88a339ec20c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13852][YARN]handle the InterruptedExcep...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11692#issuecomment-197154066 I see your point, so the real issue should only be the log issue. But marking the state as `FINISHED` with `SUCCEED` should be open to question, since here we don't know the real state of application, if it is finished with failure, say job is aborted due to continuous stage failure, should we still mark this as `SUCCEED`? So here I'm conservative to this change, because: 1. This issue is happened rarely and does no harm to the result of application. 2. We cannot get the real exit state of application, so marking as `SUCCEED` should be open to question. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13924][SQL] officially support multi-in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11754#issuecomment-197153176 **[Test build #53274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53274/consoleFull)** for PR 11754 at commit [`1ff3dec`](https://github.com/apache/spark/commit/1ff3dec5f73e53aa5a4fb5ab86a9d55e62edc202). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13889][YARN] Fix integer overflow when ...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/11713#issuecomment-197153189 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13903][SQL] Modify output nullability w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11722#issuecomment-197153188 **[Test build #53276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53276/consoleFull)** for PR 11722 at commit [`5bf4b4b`](https://github.com/apache/spark/commit/5bf4b4b544ef2aa25d93c974e94f8314a6626ef7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13316][Streaming]add check to avoid reg...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11753#issuecomment-197153177 **[Test build #53275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53275/consoleFull)** for PR 11753 at commit [`51a0399`](https://github.com/apache/spark/commit/51a03994ae3f66fa80b41a488f27b3fe94abe0ff). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13924][SQL] officially support multi-in...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/11754#issuecomment-197152664 cc @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11752#issuecomment-197152647 cc @yhuai (Since the JIRA is pretty old one, I was confused if I should make a follow-up like this but I just made this since it is a follow-up for the support of rows wrapped with an array). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13924][SQL] officially support multi-in...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/11754 [SPARK-13924][SQL] officially support multi-insert ## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark lateral-view Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11754.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11754 commit 1ff3dec5f73e53aa5a4fb5ab86a9d55e62edc202 Author: Wenchen FanDate: 2016-03-16T04:35:10Z officially support multi-insert --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11752#issuecomment-197152653 Just to make sure, I am doing this partly due to [SPARK-13764](https://issues.apache.org/jira/browse/SPARK-13764), which deals with parse modes just like in CSV data sources. So, the behaviour for failed records should be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53269/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152624 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197152557 **[Test build #53269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)** for PR 11694 at commit [`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13889][YARN] Fix integer overflow when ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11713#issuecomment-197152507 cc @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11752#issuecomment-197152532 Just to make sure, I am doing this partly due to [SPARK-13764](https://issues.apache.org/jira/browse/SPARK-13764), which deals with parse modes just like in CSV data sources. So, the behaviour for failed records should be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13316][Streaming]add check to avoid reg...
GitHub user mwws opened a pull request: https://github.com/apache/spark/pull/11753 [SPARK-13316][Streaming]add check to avoid registering new DStream when recovering from CP When creating a recoverable streaming job, it must be no new DStream registered after a StreamingContext has been recreated from checkpoint. Or you will see following exception: ``` org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ConstantInputDStream@724797ab has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:311) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:89) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:332) at scala.Option.orElse(Option.scala:289) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:329) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) ``` This PR is to add meaningful error message to make it obvious at first. I manually tested the PR with following repo code ```scala def createStreamingContext(): StreamingContext = { val ssc = new StreamingContext(sparkConf, Duration(1000)) ssc.checkpoint(checkpointDir) ssc } val ssc = StreamingContext.getOrCreate(checkpointDir), createStreamingContext) val socketStream = ssc.socketTextStream(...) socketStream.checkpoint(Seconds(1)) socketStream.foreachRDD(...) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mwws/spark SPARK-13316 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11753.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11753 commit 51a03994ae3f66fa80b41a488f27b3fe94abe0ff Author: mwwsDate: 2016-03-16T04:20:43Z add check to avoid registering new DStream when recovering from Checkpoint --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11752#issuecomment-197152181 cc @yhuai (Since the JIRA is pretty old one, I was confused if I should make a follow-up like this but I just made this since it is a follow-up for the support of rows wrapped with an array). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11752#issuecomment-197152249 **[Test build #53273 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53273/consoleFull)** for PR 11752 at commit [`4ab1e80`](https://github.com/apache/spark/commit/4ab1e80e8bca868fc2387cf74004cc44b6e4664a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][TEST][SQL] Remove wrong "expected" par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11718 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows h...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/11752 [SPARK-3308][SQL][FOLLOW-UP] Parse JSON rows having an array type and a struct type in the same fieild ## What changes were proposed in this pull request? This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` This throws an exception below: ```java Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-3308-follow-up Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11752.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11752 commit 4ab1e80e8bca868fc2387cf74004cc44b6e4664a Author: hyukjinkwonDate: 2016-03-16T04:36:56Z Parse JSON rows having an array type and a struct type in the same field. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][TEST][SQL] Remove wrong "expected" par...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11718#issuecomment-197151890 Thanks - merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][TEST][SQL] Remove wrong "expected" par...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11718#issuecomment-197151328 **[Test build #2643 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2643/consoleFull)** for PR 11718 at commit [`0bd1bbc`](https://github.com/apache/spark/commit/0bd1bbc9fa267e3c506d1f51495efa58a1b88796). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13898][SQL] Merge DatasetHolder and Dat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11737#issuecomment-197151199 **[Test build #53272 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53272/consoleFull)** for PR 11737 at commit [`e422f52`](https://github.com/apache/spark/commit/e422f529e94506905563d0b487851a891b146605). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13903][SQL] Modify output nullability w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11722#issuecomment-197149789 **[Test build #53271 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53271/consoleFull)** for PR 11722 at commit [`aef73d5`](https://github.com/apache/spark/commit/aef73d5e71b9c997ac98a7172036c7c79f9e9b1c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11751#issuecomment-197149790 **[Test build #53270 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53270/consoleFull)** for PR 11751 at commit [`b5e24a4`](https://github.com/apache/spark/commit/b5e24a41cd2b2bf021ca1edf115f26ad8b467ca6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13920][BUILD] MIMA checks should apply ...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/11751 [SPARK-13920][BUILD] MIMA checks should apply to @Experimental and @DeveloperAPI APIs ## What changes were proposed in this pull request? We are able to change `@Experimental` and `@DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check for by the followings. - Adds a on/off switch flag code for `@Experimental` and `@Developer` API. - Adds MiMa filters for `@Experimental` and `@Developer` API. Note that this PR does not delete the existing functionality. We can turn off this feature if needed. ## How was this patch tested? Pass the Jenkins tests (including MiMa). You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-13920 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11751.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11751 commit b5e24a41cd2b2bf021ca1edf115f26ad8b467ca6 Author: Dongjoon HyunDate: 2016-03-16T04:19:11Z [SPARK-13920][BUILD] MIMA checks should apply to @Experimental and @DeveloperAPI APIs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13801][SQL] DataFrame.col should return...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11632#issuecomment-197144342 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53261/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13801][SQL] DataFrame.col should return...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11632#issuecomment-197144341 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13801][SQL] DataFrame.col should return...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11632#issuecomment-197144257 **[Test build #53261 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53261/consoleFull)** for PR 11632 at commit [`6faf40e`](https://github.com/apache/spark/commit/6faf40e554d28ef625d3d818328254fe16849a52). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9837] [ML] R-like summary statistics fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11694#issuecomment-197143000 **[Test build #53269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53269/consoleFull)** for PR 11694 at commit [`f89cdf0`](https://github.com/apache/spark/commit/f89cdf01d94faab7d6f9372df35033afe695f8aa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-529] [sql] Modify SQLConf to use new co...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11570#issuecomment-197142803 @vanzin I don't think anybody gets a ping message unless you explicitly cc them. Will take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-197140007 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-197140010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53267/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-197139782 **[Test build #53267 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53267/consoleFull)** for PR 11553 at commit [`18a3ec6`](https://github.com/apache/spark/commit/18a3ec6b8e9357c0055340cc4a4860c450999366). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197138848 **[Test build #53268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53268/consoleFull)** for PR 11645 at commit [`8123818`](https://github.com/apache/spark/commit/81238189f560f5df846b36e5cb5e50532f6f82a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13922][SQL] Filter rows with null attri...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/11749#issuecomment-197135193 Can you add some test cases to columnarbatchsuite that exercises this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13852][YARN]handle the InterruptedExcep...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/11692#issuecomment-197135484 the application is finished successfully(RM UI also show success state) but log shows it failed, that's the problem i think. yeah you're right sleep method can throw InterruptedException. this pr is trying to fix the problem we find in RM HA switching. what i am trying to say is that interrupting a monitor thread should not print the failed message in log. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13922][SQL] Filter rows with null attri...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11749#discussion_r56277661 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java --- @@ -58,6 +58,9 @@ // True if the row is filtered. private final boolean[] filteredRows; + // Column indices that cannot have null values. + private final Vector nullFilteredColumns; --- End diff -- change this to a set. I think it's more accurate and makes this easier to use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13883][SQL] Parquet Implementation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11709#issuecomment-197134877 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13883][SQL] Parquet Implementation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11709#issuecomment-197134881 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53262/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13883][SQL] Parquet Implementation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11709#issuecomment-197134624 **[Test build #53262 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53262/consoleFull)** for PR 11709 at commit [`122f572`](https://github.com/apache/spark/commit/122f572a9bfb7ebdda66fae0fd70d92c9d4d6524). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13922][SQL] Filter rows with null attri...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11749#discussion_r56277607 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java --- @@ -284,11 +287,23 @@ public void reset() { } /** - * Sets the number of rows that are valid. + * Sets the number of rows that are valid. Additionally, marks all rows as "filtered" if one or + * more of their attributes are part of a non-nullable column. */ public void setNumRows(int numRows) { -assert(numRows <= this.capacity); +assert (numRows <= this.capacity); --- End diff -- did you mean to add a space? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13922][SQL] Filter rows with null attri...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11749#discussion_r56277494 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala --- @@ -299,10 +299,112 @@ object ParquetReadBenchmark { } } + def stringWithNullsScanBenchmark(values: Int, fractionOfNulls: Double): Unit = { +withTempPath { dir => + withTempTable("t1", "tempTable") { +sqlContext.range(values).registerTempTable("t1") +sqlContext.sql(s"select IF(rand(1) < $fractionOfNulls, NULL, cast(id as STRING)) as c1, " + + s"IF(rand(2) < $fractionOfNulls, NULL, cast(id as STRING)) as c2 from t1") + .write.parquet(dir.getCanonicalPath) + sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable") + +val benchmark = new Benchmark("String with Nulls Scan", values) + +benchmark.addCase("SQL Parquet Vectorized") { iter => + sqlContext.sql("select sum(length(c2)) from tempTable where c1 is " + +"not NULL and c2 is not NULL").collect() +} + +val files = SpecificParquetRecordReaderBase.listDirectory(dir).toArray +benchmark.addCase("PR Vectorized") { num => + var sum = 0 + files.map(_.asInstanceOf[String]).foreach { p => +val reader = new UnsafeRowParquetRecordReader +try { + reader.initialize(p, ("c1" :: "c2" :: Nil).asJava) + val batch = reader.resultBatch() + while (reader.nextBatch()) { +val rowIterator = batch.rowIterator() +while (rowIterator.hasNext) { + val row = rowIterator.next() + val value = row.getUTF8String(0) + if (!row.isNullAt(0) && !row.isNullAt(1)) sum += value.numBytes() +} + } +} finally { + reader.close() +} + } +} + +benchmark.addCase("PR Vectorized (Null Filtering)") { num => + var sum = 0L + files.map(_.asInstanceOf[String]).foreach { p => +val reader = new UnsafeRowParquetRecordReader +try { + reader.initialize(p, ("c1" :: "c2" :: Nil).asJava) + val batch = reader.resultBatch() + batch.filterNullsInColumn(0) + batch.filterNullsInColumn(1) + while (reader.nextBatch()) { +val rowIterator = batch.rowIterator() +while (rowIterator.hasNext) { + sum += rowIterator.next().getUTF8String(0).numBytes() +} + } +} finally { + reader.close() +} + } +} + +/* +=== +Fraction of NULLs: 0 +=== + +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz +String with Nulls Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + --- +SQL Parquet Vectorized 1164 / 1333 9.0 111.0 1.0X +PR Vectorized 809 / 882 13.0 77.1 1.4X +PR Vectorized (Null Filtering)723 / 800 14.5 69.0 1.6X + +=== +Fraction of NULLs: 0.5 +=== + +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz +String with Nulls Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + --- +SQL Parquet Vectorized983 / 1001 10.7 93.8 1.0X +PR Vectorized 699 / 728 15.0 66.7 1.4X +PR Vectorized (Null Filtering)722 / 746 14.5 68.9 1.4X + +=== --- End diff -- Instead of commenting it this way, can you put the fraction in the benchmark name? e.g. String with Nulls Scan (95%) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-13917] [SQL] generate broadcast semi jo...
Github user nongli commented on the pull request: https://github.com/apache/spark/pull/11742#issuecomment-197133107 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13917] [SQL] generate broadcast semi jo...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11742#discussion_r56277373 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala --- @@ -322,4 +327,70 @@ case class BroadcastHashJoin( """.stripMargin } } + + /** + * Generates the code for left semi join. + */ + private def codegenSemi(ctx: CodegenContext, input: Seq[ExprCode]): String = { +val (broadcastRelation, relationTerm) = prepareBroadcast(ctx) +val (keyEv, anyNull) = genStreamSideJoinKey(ctx, input) +val matched = ctx.freshName("matched") +val buildVars = genBuildSideVars(ctx, matched) +val numOutput = metricTerm(ctx, "numOutputRows") + +val checkCondition = if (condition.isDefined) { + val expr = condition.get + // evaluate the variables from build side that used by condition + val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references) + // filter the output via condition + ctx.currentVars = input ++ buildVars + val ev = BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).gen(ctx) + s""" + |$eval + |${ev.code} + |if (${ev.isNull} || !${ev.value}) continue; + """.stripMargin +} else { + "" +} + +if (broadcastRelation.value.isInstanceOf[UniqueHashedRelation]) { + s""" + |// generate join key for stream side + |${keyEv.code} + |// find matches from HashedRelation + |UnsafeRow $matched = $anyNull ? null: (UnsafeRow)$relationTerm.getValue(${keyEv.value}); + |if ($matched == null) continue; + |$checkCondition + |$numOutput.add(1); + |${consume(ctx, input)} + """.stripMargin + +} else { + val matches = ctx.freshName("matches") + val bufferType = classOf[CompactBuffer[UnsafeRow]].getName + val i = ctx.freshName("i") + val size = ctx.freshName("size") + val found = ctx.freshName("found") + s""" + |// generate join key for stream side + |${keyEv.code} + |// find matches from HashRelation + |$bufferType $matches = $anyNull ? null : ($bufferType)$relationTerm.get(${keyEv.value}); + |if ($matches == null) continue; + |int $size = $matches.size(); + |boolean $found = false; + |for (int $i = 0; $i < $size; $i++) { + | UnsafeRow $matched = (UnsafeRow) $matches.apply($i); + | $checkCondition + | $found = true; + | break; + |} + |if (!$found) continue; + |$numOutput.add(1); + |${consume(ctx, input)} + """.stripMargin + --- End diff -- and here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13917] [SQL] generate broadcast semi jo...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11742#discussion_r56277324 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala --- @@ -322,4 +327,70 @@ case class BroadcastHashJoin( """.stripMargin } } + + /** + * Generates the code for left semi join. + */ + private def codegenSemi(ctx: CodegenContext, input: Seq[ExprCode]): String = { +val (broadcastRelation, relationTerm) = prepareBroadcast(ctx) +val (keyEv, anyNull) = genStreamSideJoinKey(ctx, input) +val matched = ctx.freshName("matched") +val buildVars = genBuildSideVars(ctx, matched) +val numOutput = metricTerm(ctx, "numOutputRows") + +val checkCondition = if (condition.isDefined) { + val expr = condition.get + // evaluate the variables from build side that used by condition + val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references) + // filter the output via condition + ctx.currentVars = input ++ buildVars + val ev = BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).gen(ctx) + s""" + |$eval + |${ev.code} + |if (${ev.isNull} || !${ev.value}) continue; + """.stripMargin +} else { + "" +} + +if (broadcastRelation.value.isInstanceOf[UniqueHashedRelation]) { + s""" + |// generate join key for stream side + |${keyEv.code} + |// find matches from HashedRelation + |UnsafeRow $matched = $anyNull ? null: (UnsafeRow)$relationTerm.getValue(${keyEv.value}); + |if ($matched == null) continue; + |$checkCondition + |$numOutput.add(1); + |${consume(ctx, input)} + """.stripMargin + --- End diff -- extra new line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13923] [SQL] Implement SessionCatalog
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11750#issuecomment-197131127 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53263/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13923] [SQL] Implement SessionCatalog
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11750#issuecomment-197131125 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13923] [SQL] Implement SessionCatalog
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11750#issuecomment-197131108 **[Test build #53263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53263/consoleFull)** for PR 11750 at commit [`3b2e48a`](https://github.com/apache/spark/commit/3b2e48a4efa81bb886331ce5818e2d44c7c2f7da). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13889][YARN] Fix integer overflow when ...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/11713#issuecomment-197131091 cc @rxin @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13923] [SQL] Implement SessionCatalog
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11750#discussion_r56276611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -0,0 +1,484 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.catalog + +import java.util.concurrent.ConcurrentHashMap + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, SubqueryAlias} + + +/** + * An internal catalog that is used by a Spark Session. This internal catalog serves as a + * proxy to the underlying metastore (e.g. Hive Metastore) and it also manages temporary + * tables and functions of the Spark Session that it belongs to. + */ +class SessionCatalog(externalCatalog: ExternalCatalog) { + import ExternalCatalog._ + + private[this] val tempTables = new ConcurrentHashMap[String, LogicalPlan] + + private[this] val tempFunctions = new ConcurrentHashMap[String, CatalogFunction] + + // + // Databases + // + // All methods in this category interact directly with the underlying catalog. + // + + def createDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: Boolean): Unit = { +externalCatalog.createDatabase(dbDefinition, ignoreIfExists) + } + + def dropDatabase(db: String, ignoreIfNotExists: Boolean, cascade: Boolean): Unit = { +externalCatalog.dropDatabase(db, ignoreIfNotExists, cascade) + } + + def alterDatabase(dbDefinition: CatalogDatabase): Unit = { +externalCatalog.alterDatabase(dbDefinition) + } + + def getDatabase(db: String): CatalogDatabase = { +externalCatalog.getDatabase(db) + } + + def databaseExists(db: String): Boolean = { +externalCatalog.databaseExists(db) + } + + def listDatabases(): Seq[String] = { +externalCatalog.listDatabases() + } + + def listDatabases(pattern: String): Seq[String] = { +externalCatalog.listDatabases(pattern) + } + + // + // Tables + // + // There are two kinds of tables, temporary tables and metastore tables. + // Temporary tables are isolated across sessions and do not belong to any + // particular database. Metastore tables can be used across multiple + // sessions as their metadata is persisted in the underlying catalog. + // + + // + // | Methods that interact with metastore tables only | + // + + /** + * Create a metastore table in the database specified in `tableDefinition`. + * If no such database is specified, create it in the current database. + */ + def createTable( + currentDb: String, --- End diff -- it is somewhat strange to pass in currentDb and then rely on some table definition's database. Have you thought about just figuring out that part in the caller outside the session catalog? i.e. the catalog itself doesn't need to handle currentDb. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request: [SPARK-13852][YARN]handle the InterruptedExcep...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11692#issuecomment-197129360 But from my understanding, this exception does no harm to your application, since your application is about to finish itself, also this may happen occasionally. Also does it relate to RM HA, from my understanding, this `InterruptedException` will be thrown in any case where the code run into `sleep`, no matter in HA or not. Is there any special thing for RM HA? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13894][SQL] SqlContext.range return typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11730#issuecomment-197128907 **[Test build #53266 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53266/consoleFull)** for PR 11730 at commit [`31967ac`](https://github.com/apache/spark/commit/31967ac4a97174c11706930e09a509b08cdc7d09). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13600] [MLlib] Use approxQuantile from ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11553#issuecomment-197128917 **[Test build #53267 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53267/consoleFull)** for PR 11553 at commit [`18a3ec6`](https://github.com/apache/spark/commit/18a3ec6b8e9357c0055340cc4a4860c450999366). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197128812 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR][TEST][SQL] Remove wrong "expected" par...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11718#issuecomment-197128775 **[Test build #2643 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2643/consoleFull)** for PR 11718 at commit [`0bd1bbc`](https://github.com/apache/spark/commit/0bd1bbc9fa267e3c506d1f51495efa58a1b88796). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197128814 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53258/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-13809][SQL] State store for stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11645#issuecomment-197128647 **[Test build #53258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53258/consoleFull)** for PR 11645 at commit [`76dd988`](https://github.com/apache/spark/commit/76dd988fcac508609be3f32754284bc07524e481). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org