[GitHub] spark issue #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Blockmana...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 @cloud-fan I just find commit log below `Modified StorageLevel and BlockManagerId to cache common objects and use cached object while deserializing.` I can't figure out why we need cache, since i think the cache miss may be high? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20382 **[Test build #87664 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87664/testReport)** for PR 20382 at commit [`fd890ad`](https://github.com/apache/spark/commit/fd890ad837bb7068c70a27921d67af1c3fe65350). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87665/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20675 **[Test build #87665 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87665/testReport)** for PR 20675 at commit [`21f574e`](https://github.com/apache/spark/commit/21f574e2a3ad3c8e68b92776d2a141d7fcb90502). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20666 **[Test build #87663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87663/testReport)** for PR 20666 at commit [`1d03d3b`](https://github.com/apache/spark/commit/1d03d3b248821a05dfd2751eeb0c8b657ebc9073). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20666 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87663/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20666: [SPARK-23448][SQL] Clarify JSON and CSV parser behavior ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20666 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87664/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/20382 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20675 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1055/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1054/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Bl...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170516775 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = new TimeStampedHashMap[BlockManagerId, BlockManagerId](true) - def getCachedBlockManagerId(id: BlockManagerId): BlockManagerId = { + def getCachedBlockManagerId(id: BlockManagerId, clearOldValues: Boolean = false): BlockManagerId = + { blockManagerIdCache.putIfAbsent(id, id) -blockManagerIdCache.get(id) +val blockManagerId = blockManagerIdCache.get(id) +if (clearOldValues) { + blockManagerIdCache.clearOldValues(System.currentTimeMillis - Utils.timeStringAsMs("10d")) --- End diff -- @Ngone51 Thanks.i also though about remove when we delete a block. In this case, it is history replaying which will trigger this problem,and we do not delete any block actually. Maybe use `weakreference` better as @jiangxb1987 mentioned?WDYT? Thanks again! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20382 **[Test build #87667 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87667/testReport)** for PR 20382 at commit [`fd890ad`](https://github.com/apache/spark/commit/fd890ad837bb7068c70a27921d67af1c3fe65350). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20675 **[Test build #87666 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87666/testReport)** for PR 20675 at commit [`21f574e`](https://github.com/apache/spark/commit/21f574e2a3ad3c8e68b92776d2a141d7fcb90502). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r170519540 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/UnsafeMemoryAllocator.java --- @@ -19,15 +19,24 @@ import org.apache.spark.unsafe.Platform; +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.Method; +import java.nio.ByteBuffer; + +import sun.nio.ch.DirectBuffer; + /** * A simple {@link MemoryAllocator} that uses {@code Unsafe} to allocate off-heap memory. */ public class UnsafeMemoryAllocator implements MemoryAllocator { @Override - public MemoryBlock allocate(long size) throws OutOfMemoryError { + public OffHeapMemoryBlock allocate(long size) throws OutOfMemoryError { +// No usage of DirectByteBuffer.allocateDirect is current design --- End diff -- Since previous implementation used `DirectByteBuffer.allocateDirect`, I leave a reference for my decision. I will remove this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r170519752 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java --- @@ -45,38 +45,135 @@ */ public static final int FREED_IN_ALLOCATOR_PAGE_NUMBER = -3; - private final long length; + @Nullable + protected Object obj; + + protected long offset; + + protected long length; /** * Optional page number; used when this MemoryBlock represents a page allocated by a - * TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager, - * which lives in a different package. + * TaskMemoryManager. This field can be updated using setPageNumber method so that + * this can be modified by the TaskMemoryManager, which lives in a different package. */ - public int pageNumber = NO_PAGE_NUMBER; + private int pageNumber = NO_PAGE_NUMBER; public MemoryBlock(@Nullable Object obj, long offset, long length) { -super(obj, offset); +this.obj = obj; +this.offset = offset; this.length = length; } + public MemoryBlock() { +this(null, 0, 0); + } + + public final Object getBaseObject() { +return obj; + } + + public final long getBaseOffset() { +return offset; + } + + public void resetObjAndOffset() { +this.obj = null; +this.offset = 0; + } + /** * Returns the size of the memory block. */ - public long size() { + public final long size() { return length; } - /** - * Creates a memory block pointing to the memory used by the long array. - */ - public static MemoryBlock fromLongArray(final long[] array) { -return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, array.length * 8L); + public final void setPageNumber(int pageNum) { +pageNumber = pageNum; + } + + public final int getPageNumber() { +return pageNumber; } /** * Fills the memory block with the specified byte value. */ - public void fill(byte value) { + public final void fill(byte value) { Platform.setMemory(obj, offset, length, value); } + + /** + * Instantiate MemoryBlock for given object type with new offset + */ + public final static MemoryBlock allocateFromObject(Object obj, long offset, long length) { +MemoryBlock mb = null; +if (obj instanceof byte[]) { + byte[] array = (byte[])obj; + mb = new ByteArrayMemoryBlock(array, offset, length); +} else if (obj instanceof long[]) { + long[] array = (long[])obj; + mb = new OnHeapMemoryBlock(array, offset, length); +} else if (obj == null) { + // we assume that to pass null pointer means off-heap + mb = new OffHeapMemoryBlock(offset, length); +} else { + throw new UnsupportedOperationException(obj.getClass() + " is not supported now"); +} +return mb; + } + + /** + * Instantiate the same type of MemoryBlock with new offset and size + */ + public abstract MemoryBlock allocate(long offset, long size); + + + public abstract int getInt(long offset); + + public abstract void putInt(long offset, int value); + + public abstract boolean getBoolean(long offset); + + public abstract void putBoolean(long offset, boolean value); + + public abstract byte getByte(long offset); + + public abstract void putByte(long offset, byte value); + + public abstract short getShort(long offset); + + public abstract void putShort(long offset, short value); + + public abstract long getLong(long offset); + + public abstract void putLong(long offset, long value); + + public abstract float getFloat(long offset); + + public abstract void putFloat(long offset, float value); + + public abstract double getDouble(long offset); + + public abstract void putDouble(long offset, double value); + + public static final void copyMemory( + MemoryBlock src, long srcOffset, MemoryBlock dst, long dstOffset, long length) { +Platform.copyMemory(src.getBaseObject(), src.getBaseOffset() + srcOffset, +dst.getBaseObject(), dst.getBaseOffset() + dstOffset, length); + } + + public static final void copyMemory(MemoryBlock src, MemoryBlock dst, long length) { +Platform.copyMemory(src.getBaseObject(), src.getBaseOffset(), + dst.getBaseObject(), dst.getBaseOffset(), length); + } + + public final void copyFrom(Object src, long srcOffset, long dstOffset, long length)
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r170520696 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -50,12 +52,11 @@ // These are only updated by readExternal() or read() @Nonnull - private Object base; - private long offset; + private MemoryBlock base; private int numBytes; --- End diff -- Yeah, since I incrementally use `MemoryBlock`, some fields still remains as duplicated. I think that `UTF8String.length` can be used projected into `MemoryBlock.length` since `UTF8String` seems immutable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20667: [SPARK-23508][CORE] Use timeStampedHashMap for Bl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20667#discussion_r170520957 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManagerId.scala --- @@ -132,10 +133,15 @@ private[spark] object BlockManagerId { getCachedBlockManagerId(obj) } - val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() + val blockManagerIdCache = new TimeStampedHashMap[BlockManagerId, BlockManagerId](true) - def getCachedBlockManagerId(id: BlockManagerId): BlockManagerId = { + def getCachedBlockManagerId(id: BlockManagerId, clearOldValues: Boolean = false): BlockManagerId = + { blockManagerIdCache.putIfAbsent(id, id) -blockManagerIdCache.get(id) +val blockManagerId = blockManagerIdCache.get(id) +if (clearOldValues) { + blockManagerIdCache.clearOldValues(System.currentTimeMillis - Utils.timeStringAsMs("10d")) --- End diff -- Can we simply use `com.google.common.cache.Cache`? which has a size limitation and we don't need to worry about OOM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Use softreference for BlockmanagerId...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20667 Had a offline chat with @cloud-fan and we feel `com.google.common.cache.Cache` should be used here. You can find a example at `CodeGenerator.cache`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Use softreference for BlockmanagerId...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 Nice @jiangxb1987 @cloud-fan I will modify later.Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87668/testReport)** for PR 20611 at commit [`5e9b6a5`](https://github.com/apache/spark/commit/5e9b6a5cb9568b49cef9508e0b2dc216e0d6c1ef). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20670: [SPARK-23405] Add constranits
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20670#discussion_r170529019 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -29,12 +29,26 @@ trait QueryPlanConstraints { self: LogicalPlan => */ lazy val constraints: ExpressionSet = { if (conf.constraintPropagationEnabled) { + var relevantOutPutSet: AttributeSet = outputSet + constraints.foreach { +case eq @ EqualTo(l: Attribute, r: Attribute) => + if (l.references.subsetOf(relevantOutPutSet) --- End diff -- You can avoid computing each `subsetOf` twice here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20670: [SPARK-23405] Add constranits
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20670#discussion_r170528989 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -29,12 +29,26 @@ trait QueryPlanConstraints { self: LogicalPlan => */ lazy val constraints: ExpressionSet = { if (conf.constraintPropagationEnabled) { + var relevantOutPutSet: AttributeSet = outputSet + constraints.foreach { +case eq @ EqualTo(l: Attribute, r: Attribute) => --- End diff -- `eq` isn't used --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20662: [SPARK-23475][UI][BACKPORT-2.3] Show also skipped...
Github user mgaido91 closed the pull request at: https://github.com/apache/spark/pull/20662 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20670: [SPARK-23405] Add constranits
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20670#discussion_r170529062 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -29,12 +29,26 @@ trait QueryPlanConstraints { self: LogicalPlan => */ lazy val constraints: ExpressionSet = { if (conf.constraintPropagationEnabled) { + var relevantOutPutSet: AttributeSet = outputSet + constraints.foreach { +case eq @ EqualTo(l: Attribute, r: Attribute) => + if (l.references.subsetOf(relevantOutPutSet) +&& !r.references.subsetOf(relevantOutPutSet)) { +relevantOutPutSet = relevantOutPutSet.++(r.references) --- End diff -- Use ` ++ ` syntax, rather than write it as a method invocation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18555 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20663: [SPARK-23501][UI] Refactor AllStagesPage in order...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20663#discussion_r170532005 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala --- @@ -143,76 +72,105 @@ private[ui] class AllStagesPage(parent: StagesTab) extends WebUIPage("") { Seq.empty[Node] } } -if (shouldShowActiveStages) { - content ++= - - - -Active Stages ({activeStages.size}) - - ++ - - {activeStagesTable.toNodeSeq} - -} -if (shouldShowPendingStages) { - content ++= - - - -Pending Stages ({pendingStages.size}) - - ++ - - {pendingStagesTable.toNodeSeq} - + +tables.flatten.foreach(content ++= _) + +UIUtils.headerSparkPage("Stages for All Jobs", content, parent) + } + + def summaryAndTableForStatus( + status: StageStatus, + request: HttpServletRequest): (Option[Elem], Option[NodeSeq]) = { +val stages = if (status == StageStatus.FAILED) { + allStages.filter(_.status == status).reverse +} else { + allStages.filter(_.status == status) } -if (shouldShowCompletedStages) { - content ++= - - - -Completed Stages ({completedStageNumStr}) - - ++ - - {completedStagesTable.toNodeSeq} - + +if (stages.isEmpty) { + (None, None) +} else { + val killEnabled = status == StageStatus.ACTIVE && parent.killEnabled + val isFailedStage = status == StageStatus.FAILED + + val stagesTable = +new StageTableBase(parent.store, request, stages, tableHeaderID(status), stageTag(status), + parent.basePath, subPath, parent.isFairScheduler, killEnabled, isFailedStage) + val stagesSize = stages.size + (Some(summary(status, stagesSize)), Some(table(status, stagesTable, stagesSize))) } -if (shouldShowSkippedStages) { - content ++= - - - -Skipped Stages ({skippedStages.size}) - - ++ - - {skippedStagesTable.toNodeSeq} - + } + + private def tableHeaderID(status: StageStatus): String = status match { +case StageStatus.ACTIVE => "active" +case StageStatus.COMPLETE => "completed" +case StageStatus.FAILED => "failed" +case StageStatus.PENDING => "pending" +case StageStatus.SKIPPED => "skipped" + } + + private def stageTag(status: StageStatus): String = status match { +case StageStatus.ACTIVE => "activeStage" +case StageStatus.COMPLETE => "completedStage" +case StageStatus.FAILED => "failedStage" +case StageStatus.PENDING => "pendingStage" +case StageStatus.SKIPPED => "skippedStage" + } + + private def headerDescription(status: StageStatus): String = status match { +case StageStatus.ACTIVE => "Active" +case StageStatus.COMPLETE => "Completed" +case StageStatus.FAILED => "Failed" +case StageStatus.PENDING => "Pending" +case StageStatus.SKIPPED => "Skipped" + } + + private def classSuffix(status: StageStatus): String = status match { +case StageStatus.ACTIVE => "ActiveStages" +case StageStatus.COMPLETE => "CompletedStages" +case StageStatus.FAILED => "FailedStages" +case StageStatus.PENDING => "PendingStages" +case StageStatus.SKIPPED => "SkippedStages" + } + + private def summaryContent(status: StageStatus, size: Int): String = { +if (status == StageStatus.COMPLETE +&& appSummary.numCompletedStages != size) { + s"${appSummary.numCompletedStages}, only showing $size" +} else { + s"$size" } -if (shouldShowFailedStages) { - content ++= - - - -Failed Stages ({numFailedStages}) - - ++ - - {failedStagesTable.toNodeSeq} - + } + + private def summary(status: StageStatus, size: Int): Elem = { +val summary = + + + {headerDescription(status)} Stages: + +{summaryContent(status, size)} + + +if (status == StageStatus.COMPLETE) { + summary % Attribute(None, "id", Text("completed-summary"), Null) --- End diff -- yes, I realized it while doing the refactor. It was a copy-and-paste mistak
[GitHub] spark pull request #20663: [SPARK-23501][UI] Refactor AllStagesPage in order...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20663#discussion_r170532241 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/AllStagesPage.scala --- @@ -143,76 +72,105 @@ private[ui] class AllStagesPage(parent: StagesTab) extends WebUIPage("") { Seq.empty[Node] } } -if (shouldShowActiveStages) { - content ++= - - - -Active Stages ({activeStages.size}) - - ++ - - {activeStagesTable.toNodeSeq} - -} -if (shouldShowPendingStages) { - content ++= - - - -Pending Stages ({pendingStages.size}) - - ++ - - {pendingStagesTable.toNodeSeq} - + +tables.flatten.foreach(content ++= _) --- End diff -- that won't really works, since `tables.flatten` is a `Seq[NodeSeq]`. But I'll try and do something similar. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20667: [SPARK-23508][CORE] Fix BlockmanagerId in case blockMana...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/20667 Update @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20673: [SPARK-23515] Use input/output streams for large ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20673#discussion_r170537137 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -100,7 +102,18 @@ private[spark] object JsonProtocol { executorMetricsUpdateToJson(metricsUpdate) case blockUpdate: SparkListenerBlockUpdated => blockUpdateToJson(blockUpdate) - case _ => parse(mapper.writeValueAsString(event)) + case _ => +// Use piped streams to avoid extra memory consumption +val outputStream = new PipedOutputStream() +val inputStream = new PipedInputStream(outputStream) +try { + mapper.writeValue(outputStream, event) --- End diff -- Wait wait .. does this lazily work for sure? Can we add a test (or manual test in the PR description) that reads some more data (maybe more then the buffer size in that pipe)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20676: [SPARK-23516][CORE] It is unnecessary to transfer...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/20676 [SPARK-23516][CORE] It is unnecessary to transfer unroll memory to storage memory ## What changes were proposed in this pull request? In fact, unroll memory is also storage memory,so i think it is unnecessary to release unroll memory really, and then to get storage memory again. ## How was this patch tested? Existing unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark notreleasemem Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20676.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20676 commit 0574be52a46ff29253c432f1c71ded041d991351 Author: liuxian Date: 2018-02-26T09:38:30Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19033: [SPARK-21811][SQL]Inconsistency when finding the ...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19033#discussion_r170547020 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -150,9 +150,27 @@ object TypeCoercion { } private def findWiderCommonType(types: Seq[DataType]): Option[DataType] = { +var awaitingString = false --- End diff -- Maybe we should make a pass of all dataTypes here, and see whether there exists `StringType` and all other dataTypes can be promoted to `StringType`, under that case we can default return `StringType` instead of `None`, WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87667/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20382 **[Test build #87667 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87667/testReport)** for PR 20382 at commit [`fd890ad`](https://github.com/apache/spark/commit/fd890ad837bb7068c70a27921d67af1c3fe65350). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20382: [SPARK-23097][SQL][SS] Migrate text socket source to V2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20382 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20675 **[Test build #87666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87666/testReport)** for PR 20675 at commit [`21f574e`](https://github.com/apache/spark/commit/21f574e2a3ad3c8e68b92776d2a141d7fcb90502). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20675: [SPARK-23033][SS][Follow Up] Task level retry for contin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20675 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87666/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20670: [SPARK-23405] Add constranits
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20670 Good catch! This is a real problem, but the fix looks hacky. By definition, I think `plan.contraints` should only include constraints that refer to `plan.output`, as that's the promise a plan can make to its parent. However, join is special as `Join.condition` can refer to both of the join sides, and we add the constraints to `Join.condition`, which is kind of we are making a promise to Join's children, not parent. My proposal: ``` lazy val constraints: ExpressionSet = { if (conf.constraintPropagationEnabled) { allConstraints.filter { c => c.references.nonEmpty && c.references.subsetOf(outputSet) && c.deterministic } } else { ExpressionSet(Set.empty) } } lazy val allConstraints = ExpressionSet(validConstraints .union(inferAdditionalConstraints(validConstraints)) .union(constructIsNotNullConstraints(validConstraints))) ``` Then we can call `plan.allConstraints` when inferring contraints for join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20624: [SPARK-23445] ColumnStat refactoring
Github user juliuszsompolski commented on a diff in the pull request: https://github.com/apache/spark/pull/20624#discussion_r170567874 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/StarSchemaDetection.scala --- @@ -187,11 +187,11 @@ object StarSchemaDetection extends PredicateHelper { stats.rowCount match { case Some(rowCount) if rowCount >= 0 => if (stats.attributeStats.nonEmpty && stats.attributeStats.contains(col)) { -val colStats = stats.attributeStats.get(col) -if (colStats.get.nullCount > 0) { +val colStats = stats.attributeStats.get(col).get +if (!colStats.hasCountStats || colStats.nullCount.get > 0) { --- End diff -- `hasCountStats == distinctCount.isDefined && nullCount.isDefined` So if it passed to the second part of the ||, then `hasCountStats == true -> nullCount.isDefined` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20611 **[Test build #87668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87668/testReport)** for PR 20611 at commit [`5e9b6a5`](https://github.com/apache/spark/commit/5e9b6a5cb9568b49cef9508e0b2dc216e0d6c1ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20611 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87668/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20624: [SPARK-23445] ColumnStat refactoring
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20624 **[Test build #87671 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87671/testReport)** for PR 20624 at commit [`a006bab`](https://github.com/apache/spark/commit/a006bab64420b24d9aff242902d0145892879691). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20676 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20663 **[Test build #87670 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87670/testReport)** for PR 20663 at commit [`9a8e032`](https://github.com/apache/spark/commit/9a8e032f6d46cb181e8b775812acdda8805888a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20676 **[Test build #87669 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87669/testReport)** for PR 20676 at commit [`0574be5`](https://github.com/apache/spark/commit/0574be52a46ff29253c432f1c71ded041d991351). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20676 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1056/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1057/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20624: [SPARK-23445] ColumnStat refactoring
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20624 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1058/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20624: [SPARK-23445] ColumnStat refactoring
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20624 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20449: [SPARK-23040][CORE]: Returns interruptible iterator for ...
Github user advancedxy commented on the issue: https://github.com/apache/spark/pull/20449 @cloud-fan I have update the comments and fixed style issues(previously was auto formatted by IntelliJ) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user sujith71955 commented on the issue: https://github.com/apache/spark/pull/20611 @gatorsmile Build is passed, Please review and let me know for any suggestions. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20670: [SPARK-23405] Add constranits
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20670 Agree with that @cloud-fan proposed to have constraints for a plan and the children. However, that requires a relative wider change as well as a find set of test cases, please don't be hesitate to ask for help if you run into any issues working on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20670: [SPARK-23405] Add constranits
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20670 Also, a better title for this PR would be: ``` Generate additional constraints for Join's children ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20611 QQ: how does hive behave on the same/similar sql command? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20663 @gengliangwang Mind take a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20677: Event time can't be greater then processing time....
GitHub user deil87 opened a pull request: https://github.com/apache/spark/pull/20677 Event time can't be greater then processing time. 12:21, owl. Mistake⦠⦠on image. ## What changes were proposed in this pull request? There is an error on image. Point 12:21 for owl can't have 12:19 processing time. Consequently new watermark should be computed on another data (12:15, cat). Donkey still will be rejected I suppose but I'm not sure whether those watermarks inclusive or exclusive. If watermark is set to 12:05 then I guess intermediate state for window 11:55(inclusive)-12:05(exclusive) could be cleared and there is no place for Donkey anymore. You need to change image and depending on how you will change it you need to adjust the text. ## How was this patch tested? No tests for docs. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/deil87/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20677.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20677 commit 1bf1017db1e484d0ad85c14b166961ba35b08cc8 Author: Spiridonov Andrey Date: 2018-02-26T14:30:45Z Event time can't be greater then processing time. 12:21, owl. Mistake on image. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
Github user ala commented on a diff in the pull request: https://github.com/apache/spark/pull/20664#discussion_r170611946 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { }.collect() } + test("SPARK-23496: order of input partitions can result in severe skew in coalesce") { +val numInputPartitions = 100 +val numCoalescedPartitions = 50 +val locations = Array("locA", "locB") + +val inputRDD = sc.makeRDD(Range(0, numInputPartitions).toArray[Int], numInputPartitions) +assert(inputRDD.getNumPartitions == numInputPartitions) + +val locationPrefRDD = new LocationPrefRDD(inputRDD, { (p: Partition) => + if (p.index < numCoalescedPartitions) { +Seq(locations(0)) + } else { +Seq(locations(1)) + } +}) +val coalescedRDD = new CoalescedRDD(locationPrefRDD, numCoalescedPartitions) + +val numPartsPerLocation = coalescedRDD + .getPartitions + .map(coalescedRDD.getPreferredLocations(_).head) + .groupBy(identity) + .mapValues(_.size) + +// Without the fix these would be: +// numPartsPerLocation(locations(0)) == numCoalescedPartitions - 1 +// numPartsPerLocation(locations(1)) == 1 +assert(numPartsPerLocation(locations(0)) > 0.4 * numCoalescedPartitions) --- End diff -- Added comment about flakiness & fixed seed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20677: Event time can't be greater then processing time. 12:21,...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20677 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20664: [SPARK-23496][CORE] Locality of coalesced partiti...
Github user ala commented on a diff in the pull request: https://github.com/apache/spark/pull/20664#discussion_r170612006 --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala --- @@ -1129,6 +1129,36 @@ class RDDSuite extends SparkFunSuite with SharedSparkContext { }.collect() } + test("SPARK-23496: order of input partitions can result in severe skew in coalesce") { +val numInputPartitions = 100 +val numCoalescedPartitions = 50 +val locations = Array("locA", "locB") + +val inputRDD = sc.makeRDD(Range(0, numInputPartitions).toArray[Int], numInputPartitions) +assert(inputRDD.getNumPartitions == numInputPartitions) + +val locationPrefRDD = new LocationPrefRDD(inputRDD, { (p: Partition) => + if (p.index < numCoalescedPartitions) { +Seq(locations(0)) + } else { +Seq(locations(1)) + } +}) +val coalescedRDD = new CoalescedRDD(locationPrefRDD, numCoalescedPartitions) + +val numPartsPerLocation = coalescedRDD + .getPartitions + .map(coalescedRDD.getPreferredLocations(_).head) + .groupBy(identity) + .mapValues(_.size) + +// Without the fix these would be: --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20664 **[Test build #87672 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87672/testReport)** for PR 20664 at commit [`0512736`](https://github.com/apache/spark/commit/051273651cd65b9eca568b37c79b50342a7f69c2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20677: Event time can't be greater then processing time. 12:21,...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20677 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20676 **[Test build #87669 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87669/testReport)** for PR 20676 at commit [`0574be5`](https://github.com/apache/spark/commit/0574be52a46ff29253c432f1c71ded041d991351). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20664 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20676 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20676: [SPARK-23516][CORE] It is unnecessary to transfer unroll...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20676 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87669/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20664: [SPARK-23496][CORE] Locality of coalesced partitions can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20664 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1059/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20611: [SPARK-23425][SQL]When wild card is been used in load co...
Github user sujith71955 commented on the issue: https://github.com/apache/spark/pull/20611 @jiangxb1987 Yes, Hive supports such way of data loading, suppose we have many files which starts with same naming conventions, user will preferto use such kind of queries where he can take advantage of wild cards eg: load data inpath 'hdfs://hacluster/user/ext* into table t1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20678 [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame ## What changes were proposed in this pull request? This PR adds a configuration to control the fallback of Arrow optimization for `toPandas` and `createDataFrame` with Pandas DataFrame. ## How was this patch tested? Manually tested and unit tests added. You can test this by: **`createDataFrame`** ```python spark.conf.set("spark.sql.execution.arrow.enabled", False) pdf = spark.createDataFrame([[{'a': 1}]]).toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True) spark.createDataFrame(pdf, "a: map") ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled", False) pdf = spark.createDataFrame([[{'a': 1}]]).toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False) spark.createDataFrame(pdf, "a: map") ``` **`toPandas`** ```python spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True) spark.createDataFrame([[{'a': 1}]]).toPandas() ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False) spark.createDataFrame([[{'a': 1}]]).toPandas() ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-23380-conf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20678.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20678 commit ff9d38b691bd54c073db3b55983564cfbb0d903e Author: hyukjinkwon Date: 2018-02-26T15:02:55Z Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fallback to n...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/20567 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 I just opened another PR for adding a configuration - https://github.com/apache/spark/pull/20678. Let me close this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19222#discussion_r170622554 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/memory/UnsafeMemoryAllocator.java --- @@ -19,15 +19,24 @@ import org.apache.spark.unsafe.Platform; +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.Method; +import java.nio.ByteBuffer; + +import sun.nio.ch.DirectBuffer; + /** * A simple {@link MemoryAllocator} that uses {@code Unsafe} to allocate off-heap memory. */ public class UnsafeMemoryAllocator implements MemoryAllocator { @Override - public MemoryBlock allocate(long size) throws OutOfMemoryError { + public OffHeapMemoryBlock allocate(long size) throws OutOfMemoryError { +// No usage of DirectByteBuffer.allocateDirect is current design --- End diff -- Let me put this comment here ``` No usage of DirectByteBuffer.allocateDirect is current design Platform.allocateMemory is used here. http://downloads.typesafe.com/website/presentations/ScalaDaysSF2015/T4_Xin_Performance_Optimization.pdf#page=26 ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20678 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20678 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1060/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20678 **[Test build #87673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87673/testReport)** for PR 20678 at commit [`ff9d38b`](https://github.com/apache/spark/commit/ff9d38b691bd54c073db3b55983564cfbb0d903e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20663 **[Test build #87670 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87670/testReport)** for PR 20663 at commit [`9a8e032`](https://github.com/apache/spark/commit/9a8e032f6d46cb181e8b775812acdda8805888a8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallb...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20678#discussion_r170623237 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1986,55 +1986,89 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +_should_fallback = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps, to_arrow_schema +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version + require_minimum_pyarrow_version() -import pyarrow to_arrow_schema(self.schema) -tables = self._collectAsArrow() -if tables: -table = pyarrow.concat_tables(tables) -pdf = table.to_pandas() -pdf = _check_dataframe_convert_date(pdf, self.schema) -return _check_dataframe_localize_timestamps(pdf, timezone) -else: -return pd.DataFrame.from_records([], columns=self.columns) except Exception as e: -msg = ( -"Note: toPandas attempted Arrow optimization because " -"'spark.sql.execution.arrow.enabled' is set to true. Please set it to false " -"to disable this.") -raise RuntimeError("%s\n%s" % (_exception_message(e), msg)) -else: -pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) -dtype = {} +if self.sql_ctx.getConf("spark.sql.execution.arrow.fallback.enabled", "false") \ +.lower() == "true": +msg = ( +"toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true; however, " +"failed by the reason below:\n" +" %s\n" +"Attempts non-optimization as " +"'spark.sql.execution.arrow.fallback.enabled' is set to " +"true." % _exception_message(e)) +warnings.warn(msg) +_should_fallback = True +else: +msg = ( +"toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true; however, " +"failed by the reason below:\n" +" %s\n" +"For fallback to non-optimization automatically, please set true to " +"'spark.sql.execution.arrow.fallback.enabled'." % _exception_message(e)) +raise RuntimeError(msg) + +if not _should_fallback: +try: +from pyspark.sql.types import _check_dataframe_convert_date, \ +_check_dataframe_localize_timestamps +import pyarrow + +tables = self._collectAsArrow() +if tables: +table = pyarrow.concat_tables(tables) +pdf = table.to_pandas() +pdf = _check_dataframe_convert_date(pdf, self.schema) +return _check_dataframe_localize_timestamps(pdf, timezone) +else: +return pd.DataFrame.from_records([], columns=self.columns) +except Exception as e: +# We might have to allow fallback here as well but multiple Spark jobs can +# be executed. So, simply fail in this case for now. +msg = ( +"toPandas attempted Arrow optimization because " +"'spark.sql.execution.arrow.enabled' is set to true; however, " +"failed unexpectedly:\n" +" %s" % _exception_message(e)) +raise RuntimeError(msg) + +# Below is toPandas without Arrow optimization. --- End diff -- Likewise, the change from here is due to removed `else:` block. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87670/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20663 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20640: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user IgorBerman commented on the issue: https://github.com/apache/spark/pull/20640 @skonto, @susanxhuynh, @squito So let's agree that: 1. I'll revert log when there is some failure. I'll reword it to be something without "blacklisting" 2. The blacklisting itself will be moved to BlacklistTracker(as it now) bottom line the only thing missing - adding log in a case of failure(but without counting number of failures etc) WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20677: Event time can't be greater then processing time. 12:21,...
Github user deil87 commented on the issue: https://github.com/apache/spark/pull/20677 This patch needs to be changed before it can be merged. It is an image that involved in mistake and I don't have sources for image to change it nicely. Admins should verify and suggest next actions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20678 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20678 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1061/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20678: [SPARK-23380][PYTHON] Adds a conf for Arrow fallback in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20678 **[Test build #87674 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87674/testReport)** for PR 20678 at commit [`7f87d25`](https://github.com/apache/spark/commit/7f87d2537488ca03c926f4d9c6318451c688ebe5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20679: [SPARK-23514] Use SessionState.newHadoopConf() to...
GitHub user juliuszsompolski opened a pull request: https://github.com/apache/spark/pull/20679 [SPARK-23514] Use SessionState.newHadoopConf() to propage hadoop configs set in SQLConf. ## What changes were proposed in this pull request? A few places in `spark-sql` were using `sc.hadoopConfiguration` directly. They should be using `sessionState.newHadoopConf()` to blend in configs that were set through `SQLConf`. Also, for better UX, for these configs blended in from `SQLConf`, we should consider removing the `spark.hadoop` prefix, so that the settings are recognized whether or not they were specified by the user. ## How was this patch tested? Tested that AlterTableRecoverPartitions now correctly recognizes settings that are passed in to the FileSystem through SQLConf. You can merge this pull request into a Git repository by running: $ git pull https://github.com/juliuszsompolski/apache-spark SPARK-23514 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20679.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20679 commit 2c070fcd053acb47d8a8c3214d67e106b5683376 Author: Juliusz Sompolski Date: 2018-02-26T15:13:23Z spark-23514 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user juliuszsompolski commented on the issue: https://github.com/apache/spark/pull/20679 cc @gatorsmile @rxin We had a chat whether to implement something that would catch direct misuses of sc.hadoopConfiguration in sql module, but it seems that it's not very common, so maybe just fixing it where it happened is enough. @liancheng suggested stripping the "spark.hadoop" prefix to have more compatibility with users specifying or not specifying that prefix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20624: [SPARK-23445] ColumnStat refactoring
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20624 **[Test build #87671 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87671/testReport)** for PR 20624 at commit [`a006bab`](https://github.com/apache/spark/commit/a006bab64420b24d9aff242902d0145892879691). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20679 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20679 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1062/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20663 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20679: [SPARK-23514] Use SessionState.newHadoopConf() to propag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20679 **[Test build #87675 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87675/testReport)** for PR 20679 at commit [`2c070fc`](https://github.com/apache/spark/commit/2c070fcd053acb47d8a8c3214d67e106b5683376). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20663: [SPARK-23501][UI] Refactor AllStagesPage in order to avo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20663 **[Test build #87676 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87676/testReport)** for PR 20663 at commit [`9a8e032`](https://github.com/apache/spark/commit/9a8e032f6d46cb181e8b775812acdda8805888a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20624: [SPARK-23445] ColumnStat refactoring
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20624 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org