[GitHub] [spark] HyukjinKwon commented on a change in pull request #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error
HyukjinKwon commented on a change in pull request #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error URL: https://github.com/apache/spark/pull/25333#discussion_r333822084 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecutionSuite.scala ## @@ -68,4 +72,43 @@ class MicroBatchExecutionSuite extends StreamTest with BeforeAndAfter { CheckNewAnswer((25, 1), (30, 1)) // This should not throw the error reported in SPARK-24156 ) } + + test("Add config to retry spark streaming's meta log when it met") { +val s = MemoryStream[Int] +// Specified checkpointLocation manually to init metadata file +val tmp = + new File(System.getProperty("java.io.tmpdir"), UUID.randomUUID().toString).getCanonicalPath Review comment: You can just use `withTempDir` and `withTempPath`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error
HyukjinKwon commented on issue #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error URL: https://github.com/apache/spark/pull/25333#issuecomment-540898145 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333778200 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -20,8 +20,9 @@ package org.apache.spark import java.io._ import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, ThreadPoolExecutor, TimeUnit} import java.util.concurrent.locks.ReentrantReadWriteLock -import java.util.zip.{GZIPInputStream, GZIPOutputStream} +import com.github.luben.zstd.ZstdInputStream +import com.github.luben.zstd.ZstdOutputStream Review comment: I recall we put third party imports under java and scala. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333821336 ## File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala ## @@ -333,4 +333,63 @@ class MapOutputTrackerSuite extends SparkFunSuite { rpcEnv.shutdown() } + ignore("Benchmarking `MapOutputTracker.serializeMapStatuses`") { +val newConf = new SparkConf + +// needs TorrentBroadcast so need a SparkContext +withSpark(new SparkContext("local", "MapOutputTrackerSuite", newConf)) { sc => + val tracker = sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] + val rpcEnv = sc.env.rpcEnv + val masterEndpoint = new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, newConf) + rpcEnv.stop(tracker.trackerEndpoint) + rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME, masterEndpoint) Review comment: Do we need to set up those if we just benchmark serializeMapStatuses? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liyinan926 commented on issue #26075: [WIP][K8S] Spark operator
liyinan926 commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540893571 Agreed with what @skonto said. This doesn't feel like something that is necessarily part of core Spark. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #25929: [SPARK-29116][PYTHON][ML] Refactor py classes related to DecisionTree
zhengruifeng commented on issue #25929: [SPARK-29116][PYTHON][ML] Refactor py classes related to DecisionTree URL: https://github.com/apache/spark/pull/25929#issuecomment-540890576 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to data source before planning
cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to data source before planning URL: https://github.com/apache/spark/pull/19424#issuecomment-540890347 Usually a data source scans its data incrementally. So when the query has a limit, Spark stops consuming the iterator from data source, and data source won't scan all the data. But limit pushdown does have use cases. For now the only way is to write a catalyst rule, to catch the limit + scan query plan, and convert it to a custom query plan with limit pushed down. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shivusondur commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.
shivusondur commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference. URL: https://github.com/apache/spark/pull/25561#discussion_r333816964 ## File path: docs/sql-ref-syntax-aux-show-tables.md ## @@ -18,5 +18,90 @@ license: | See the License for the specific language governing permissions and limitations under the License. --- +### Description -**This page is under construction** +The `SHOW TABLES` statement returns all the tables for an optionally specified database. +Additionally, the output of this statement may be filtered by an optional matching +pattern. If no database is specified then the tables are returned from the +current database. + +### Syntax +{% highlight sql %} +SHOW TABLES [{FROM|IN} database_name] [LIKE 'regex_pattern'] +{% endhighlight %} + +### Parameters + + {FROM|IN} database_name + + Specifies the database name from which tables are listed. + + LIKE regex_pattern + + Specifies the regular expression pattern that is used to filter out unwanted tables. + + Except for * and | character, the pattern works like a regex. Review comment: Done. The same correction I also interested. thank you This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815564 ## File path: graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala ## @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.scalatest.Matchers + +import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN} +import org.apache.spark.sql.{DataFrame, QueryTest} +import org.apache.spark.sql.test.SharedSparkSession + +abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession with Matchers { + + // Override in spark-cypher + type IdType = Long + def convertId(inputId: Long): IdType + + def cypherSession: CypherSession + + lazy val nodes: DataFrame = spark +.createDataFrame( + Seq( +(0L, true, true, false, false, Some(42), Some("Alice"), None, None), +(1L, true, true, false, false, Some(23), Some("Bob"), None, None), +(2L, true, false, true, false, Some(22), Some("Carol"), Some("CS"), None), +(3L, true, true, false, false, Some(19), Some("Eve"), None, None), +(4L, false, false, false, true, None, None, None, Some("UC Berkeley")), +(5L, false, false, false, true, None, None, None, Some("Stanford" +.toDF( + ID_COLUMN, + label("Person"), + label("Student"), + label("Teacher"), + label("University"), + "age", + "name", + "subject", + "title") + + lazy val relationships: DataFrame = spark Review comment: ```scala - lazy val relationships: DataFrame = spark + lazy val relationships: Dataset[Row] = spark ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815513 ## File path: graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala ## @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.scalatest.Matchers + +import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN} +import org.apache.spark.sql.{DataFrame, QueryTest} +import org.apache.spark.sql.test.SharedSparkSession + +abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession with Matchers { + + // Override in spark-cypher + type IdType = Long + def convertId(inputId: Long): IdType + + def cypherSession: CypherSession + + lazy val nodes: DataFrame = spark Review comment: ```scala - lazy val nodes: DataFrame = spark + lazy val nodes: Dataset[Row] = spark ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815461 ## File path: graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala ## @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.scalatest.Matchers + +import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN} +import org.apache.spark.sql.{DataFrame, QueryTest} Review comment: ```scala -import org.apache.spark.sql.{DataFrame, QueryTest} +import org.apache.spark.sql.{Dataset, QueryTest, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815301 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrameBuilder.scala ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame + +/** + * Interface used to build a [[RelationshipFrame]]. + * + * @param df DataFrame containing a single relationship in each row + * @since 3.0.0 + */ +final class RelationshipFrameBuilder(val df: DataFrame) { Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815222 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrameBuilder.scala ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815189 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * Describes how to map a DataFrame to relationships. + * + * Each row in the DataFrame represents a relationship with the given relationship type. + * + * @param df DataFrame containing a single relationship in each row + * @param idColumn column that contains the relationship identifier + * @param sourceIdColumn column that contains the source node identifier of the relationship + * @param targetIdColumn column that contains the target node identifier of the relationship + * @param relationshipType relationship type that is assigned to all relationships + * @param properties mapping from property keys to corresponding columns + * @since 3.0.0 + */ +case class RelationshipFrame private[graph] ( +df: DataFrame, Review comment: ```scala -df: DataFrame, +df: Dataset[Row], ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815075 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * A Property Graph as defined by the openCypher Property Graph Data Model. + * + * A graph is always tied to and managed by a [[CypherSession]]. + * The lifetime of a graph is bound by the session lifetime. + * + * @see http://www.opencypher.org/;>openCypher project + * @see https://dl.acm.org/citation.cfm?id=3183713.3190657;>Property Graph Model + * @since 3.0.0 + */ +abstract class PropertyGraph { + + /** + * The schema (graph type) describes the structure of this graph. + * + * @since 3.0.0 + */ + def schema: PropertyGraphType + + /** + * The session in which this graph is managed. + * + * @since 3.0.0 + */ + def cypherSession: CypherSession + + /** + * Executes a Cypher query in the session that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @since 3.0.0 + */ + def cypher(query: String): CypherResult = cypher(query, Map.empty[String, Any]) + + /** + * Executes a Cypher query in the session that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(query: String, parameters: Map[String, Any]): CypherResult = +cypherSession.cypher(this, query, parameters) + + /** + * Executes a Cypher query in the [[CypherSession]] that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(query: String, parameters: java.util.Map[String, Object]): CypherResult = +cypherSession.cypher(this, query, parameters) + + /** + * Returns the [[NodeFrame]] for a given node label set. + * + * @param labelSet Label set used for NodeFrame lookup + * @return NodeFrame for the given label set + * @since 3.0.0 + */ + def nodeFrame(labelSet: Array[String]): NodeFrame + + /** + * Returns the [[RelationshipFrame]] for a given relationship type. + * + * @param relationshipType Relationship type used for RelationshipFrame lookup + * @return RelationshipFrame for the given relationship type + * @since 3.0.0 + */ + def relationshipFrame(relationshipType: String): RelationshipFrame + + /** + * Returns a DataFrame that contains a row for each node in this graph. + * + * The DataFrame adheres to the following column naming conventions: + * + * {{{ + * Id column:`$ID` + * Label columns:`:{LABEL_NAME}` + * Property columns: `{Property_Key}` + * }}} + * + * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, relationships)` + * @since 3.0.0 + */ + def nodes: DataFrame + + /** + * Returns a DataFrame that contains a row for each relationship in this + * graph. + * + * The DataFrame adheres to column naming conventions: + * + * {{{ + * Id column:`$ID` + * SourceId column: `$SOURCE_ID` + * TargetId column: `$TARGET_ID` + * RelType columns: `:{REL_TYPE}` + * Property columns: `{Property_Key}` + * }}} + * + * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, relationships)` + * @since 3.0.0 + */ + def relationships: DataFrame Review comment: ```scala - def relationships: DataFrame + def relationships: Dataset[Row] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815049 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * A Property Graph as defined by the openCypher Property Graph Data Model. + * + * A graph is always tied to and managed by a [[CypherSession]]. + * The lifetime of a graph is bound by the session lifetime. + * + * @see http://www.opencypher.org/;>openCypher project + * @see https://dl.acm.org/citation.cfm?id=3183713.3190657;>Property Graph Model + * @since 3.0.0 + */ +abstract class PropertyGraph { + + /** + * The schema (graph type) describes the structure of this graph. + * + * @since 3.0.0 + */ + def schema: PropertyGraphType + + /** + * The session in which this graph is managed. + * + * @since 3.0.0 + */ + def cypherSession: CypherSession + + /** + * Executes a Cypher query in the session that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @since 3.0.0 + */ + def cypher(query: String): CypherResult = cypher(query, Map.empty[String, Any]) + + /** + * Executes a Cypher query in the session that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(query: String, parameters: Map[String, Any]): CypherResult = +cypherSession.cypher(this, query, parameters) + + /** + * Executes a Cypher query in the [[CypherSession]] that manages this graph, using this graph as + * the input graph. + * + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(query: String, parameters: java.util.Map[String, Object]): CypherResult = +cypherSession.cypher(this, query, parameters) + + /** + * Returns the [[NodeFrame]] for a given node label set. + * + * @param labelSet Label set used for NodeFrame lookup + * @return NodeFrame for the given label set + * @since 3.0.0 + */ + def nodeFrame(labelSet: Array[String]): NodeFrame + + /** + * Returns the [[RelationshipFrame]] for a given relationship type. + * + * @param relationshipType Relationship type used for RelationshipFrame lookup + * @return RelationshipFrame for the given relationship type + * @since 3.0.0 + */ + def relationshipFrame(relationshipType: String): RelationshipFrame + + /** + * Returns a DataFrame that contains a row for each node in this graph. + * + * The DataFrame adheres to the following column naming conventions: + * + * {{{ + * Id column:`$ID` + * Label columns:`:{LABEL_NAME}` + * Property columns: `{Property_Key}` + * }}} + * + * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, relationships)` + * @since 3.0.0 + */ + def nodes: DataFrame Review comment: ```scala - def nodes: DataFrame + def nodes: Dataset[Row] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333815169 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814921 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrameBuilder.scala ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame + +/** + * Interface used to build a [[NodeFrame]]. + * + * @param df DataFrame containing a single node in each row + * @since 3.0.0 + */ +final class NodeFrameBuilder(var df: DataFrame) { Review comment: ```scala -final class NodeFrameBuilder(var df: DataFrame) { +final class NodeFrameBuilder(var df: Dataset[Row]) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814998 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala ## @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814906 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrameBuilder.scala ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814839 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrame.scala ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814876 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrame.scala ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * Describes how to map a DataFrame to nodes. + * + * Each row in the DataFrame represents a node which has exactly the labels defined by the given + * label set. + * + * @param df DataFrame containing a single node in each row + * @param idColumn column that contains the node identifier + * @param labelSet labels that are assigned to all nodes + * @param properties mapping from property keys to corresponding columns + * @since 3.0.0 + */ +case class NodeFrame private[graph] ( +df: DataFrame, Review comment: ```scala -df: DataFrame, +df: Dataset[Row], ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814769 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/GraphElementFrame.scala ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814743 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala ## @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} +import org.apache.spark.sql.types.{BooleanType, StructType} + +/** + * Contains constants used for convention based column naming. + */ +object CypherSession { + + /** + * Naming convention for identifier columns, both node and relationship identifiers. + */ + val ID_COLUMN = "$ID" + + /** + * Naming convention for relationship source identifier. + */ + val SOURCE_ID_COLUMN = "$SOURCE_ID" + + /** + * Naming convention for relationship target identifier. + */ + val TARGET_ID_COLUMN = "$TARGET_ID" + + /** + * Naming convention both for node label and relationship type prefixes. + */ + val LABEL_COLUMN_PREFIX = ":" +} + +/** + * A CypherSession allows for creating, storing and loading [[PropertyGraph]] instances as well as + * executing Cypher queries on them. + * + * Wraps a [[org.apache.spark.sql.SparkSession]]. + * + * @since 3.0.0 + */ +trait CypherSession extends Logging { + + def sparkSession: SparkSession + + /** + * Executes a Cypher query on the given input graph. + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * Note that queries can take optional parameters: + * + * {{{ + * Parameters: + * + * { + *"name" : "Alice" + * } + * + * Query: + * + * MATCH (n:Person) + * WHERE n.name = $name + * RETURN n + * }}} + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String, parameters: Map[String, Any]): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * Note that queries can take optional parameters: + * + * {{{ + * Parameters: + * + * { + *"name" : "Alice" + * } + * + * Query: + * + * MATCH (n:Person) + * WHERE n.name = $name + * RETURN n + * }}} + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher( + graph: PropertyGraph, + query: String, + parameters: java.util.Map[String, Object]): CypherResult = { +cypher(graph, query, parameters.asScala.toMap) + } + + /** + * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and [[RelationshipFrame]]s. + * At least one [[NodeFrame]] has to be provided. + * + * For each label set and relationship type there can be at most one [[NodeFrame]] and at most one + * [[RelationshipFrame]], respectively. + * + * @param nodes NodeFrames that define the nodes in the graph + * @param relationships RelationshipFrames that define the relationships in the graph + * @since 3.0.0 + */ + def createGraph(nodes: Array[NodeFrame], relationships: Array[RelationshipFrame]): PropertyGraph + + /** + * Creates a [[PropertyGraph]] from nodes and relationships. + * + * The given DataFrames need to adhere to the following column naming conventions: + * + * {{{ + * Id column:`$ID`(nodes and relationships) + * SourceId column: `$SOURCE_ID` (relationships) + * TargetId column: `$TARGET_ID` (relationships) + * + * Label columns:`:{LABEL_NAME}` (nodes) + * RelType columns:
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814802 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/GraphElementFrame.scala ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * A [[PropertyGraph]] is created from GraphElementFrames. + * + * A graph element is either a node or a relationship. + * A GraphElementFrame wraps a DataFrame and describes how it maps to graph elements. + * + * @since 3.0.0 + */ +abstract class GraphElementFrame { + + /** + * Initial DataFrame that can still contain unmapped, arbitrarily ordered columns. + * + * @since 3.0.0 + */ + def df: DataFrame Review comment: ```scala - def df: DataFrame + def df: Dataset[Row] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814672 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala ## @@ -0,0 +1,186 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} + +object CypherSession { + val ID_COLUMN = "$ID" + val SOURCE_ID_COLUMN = "$SOURCE_ID" + val TARGET_ID_COLUMN = "$TARGET_ID" + val LABEL_COLUMN_PREFIX = ":" +} + +/** + * The entry point for using property graphs in Spark. + * + * Provides factory methods for creating [[PropertyGraph]] instances. + * + * Wraps a [[org.apache.spark.sql.SparkSession]]. + * + * @since 3.0.0 + */ +trait CypherSession { + + def sparkSession: SparkSession + + /** + * Executes a Cypher query on the given input graph. + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String, parameters: Map[String, Any]): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, + query: String, + parameters: java.util.Map[String, Object]): CypherResult = { +cypher(graph, query, parameters.asScala.toMap) + } + + /** + * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and [[RelationshipFrame]]s. + * At least one [[NodeFrame]] has to be provided. + * + * For each label set and relationship type there can be at most one [[NodeFrame]] and at most one + * [[RelationshipFrame]], respectively. + * + * @param nodes NodeFrames that define the nodes in the graph + * @param relationships RelationshipFrames that define the relationships in the graph + * @since 3.0.0 + */ + def createGraph(nodes: Seq[NodeFrame], relationships: Seq[RelationshipFrame]): PropertyGraph + + /** + * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and [[RelationshipFrame]]s. + * At least one [[NodeFrame]] has to be provided. + * + * For each label set and relationship type there can be at most one [[NodeFrame]] and at most one + * [[RelationshipFrame]], respectively. + * + * @param nodes NodeFrames that define the nodes in the graph + * @param relationships RelationshipFrames that define the relationships in the graph + * @since 3.0.0 + */ + def createGraph( + nodes: java.util.List[NodeFrame], + relationships: java.util.List[RelationshipFrame]): PropertyGraph = { +createGraph(nodes.asScala, relationships.asScala) + } + + /** + * Creates a [[PropertyGraph]] from nodes and relationships. + * + * The given DataFrames need to adhere to the following column naming conventions: + * + * {{{ + * Id column:`$ID`(nodes and relationships) + * SourceId column: `$SOURCE_ID` (relationships) + * TargetId column: `$TARGET_ID` (relationships) + * + * Label columns:`:{LABEL_NAME}` (nodes) + * RelType columns: `:{REL_TYPE}`(relationships) + * + * Property columns: `{Property_Key}` (nodes and relationships) + * }}} + * + * @see [[CypherSession]] + * @param nodes node DataFrame + * @param relationships relationship DataFrame + * @since 3.0.0 + */ + def createGraph(nodes: DataFrame, relationships: DataFrame): PropertyGraph = { Review comment: ```scala -
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814596 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala ## @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} Review comment: Actually, the following for the final update. ```scala -import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} +import org.apache.spark.sql.{Dataset, Row, SparkSession} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814713 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala ## @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} +import org.apache.spark.sql.types.{BooleanType, StructType} + +/** + * Contains constants used for convention based column naming. + */ +object CypherSession { + + /** + * Naming convention for identifier columns, both node and relationship identifiers. + */ + val ID_COLUMN = "$ID" + + /** + * Naming convention for relationship source identifier. + */ + val SOURCE_ID_COLUMN = "$SOURCE_ID" + + /** + * Naming convention for relationship target identifier. + */ + val TARGET_ID_COLUMN = "$TARGET_ID" + + /** + * Naming convention both for node label and relationship type prefixes. + */ + val LABEL_COLUMN_PREFIX = ":" +} + +/** + * A CypherSession allows for creating, storing and loading [[PropertyGraph]] instances as well as + * executing Cypher queries on them. + * + * Wraps a [[org.apache.spark.sql.SparkSession]]. + * + * @since 3.0.0 + */ +trait CypherSession extends Logging { + + def sparkSession: SparkSession + + /** + * Executes a Cypher query on the given input graph. + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * Note that queries can take optional parameters: + * + * {{{ + * Parameters: + * + * { + *"name" : "Alice" + * } + * + * Query: + * + * MATCH (n:Person) + * WHERE n.name = $name + * RETURN n + * }}} + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher(graph: PropertyGraph, query: String, parameters: Map[String, Any]): CypherResult + + /** + * Executes a Cypher query on the given input graph. + * + * Note that queries can take optional parameters: + * + * {{{ + * Parameters: + * + * { + *"name" : "Alice" + * } + * + * Query: + * + * MATCH (n:Person) + * WHERE n.name = $name + * RETURN n + * }}} + * + * @param graph [[PropertyGraph]] on which the query is executed + * @param query Cypher query to execute + * @param parameters parameters used by the Cypher query + * @since 3.0.0 + */ + def cypher( + graph: PropertyGraph, + query: String, + parameters: java.util.Map[String, Object]): CypherResult = { +cypher(graph, query, parameters.asScala.toMap) + } + + /** + * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and [[RelationshipFrame]]s. + * At least one [[NodeFrame]] has to be provided. + * + * For each label set and relationship type there can be at most one [[NodeFrame]] and at most one + * [[RelationshipFrame]], respectively. + * + * @param nodes NodeFrames that define the nodes in the graph + * @param relationships RelationshipFrames that define the relationships in the graph + * @since 3.0.0 + */ + def createGraph(nodes: Array[NodeFrame], relationships: Array[RelationshipFrame]): PropertyGraph + + /** + * Creates a [[PropertyGraph]] from nodes and relationships. + * + * The given DataFrames need to adhere to the following column naming conventions: + * + * {{{ + * Id column:`$ID`(nodes and relationships) + * SourceId column: `$SOURCE_ID` (relationships) + * TargetId column: `$TARGET_ID` (relationships) + * + * Label columns:`:{LABEL_NAME}` (nodes) + * RelType columns:
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814508 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherResult.scala ## @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * Result of a Cypher query. + * + * Wraps a DataFrame that contains the result rows. + * + * @since 3.0.0 + */ +trait CypherResult { + + /** + * Contains the result rows. + * + * The column names are aligned with the return item names specified within the Cypher query, + * (e.g. `RETURN foo, bar AS baz` results in the columns `foo` and `baz`). + * + * @note Dot characters (i.e. `.`) within return item names are replaced by an underscore (`_`), + * (e.g. `MATCH (n:Person) RETURN n` results in the columns `n`, `n:Person` and `n_name`). + * @since 3.0.0 + */ + def df: DataFrame Review comment: ```scala - def df: DataFrame + def df: Dataset[Row] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333814464 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherResult.scala ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame Review comment: ```scala -import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.{Dataset, Row} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333813221 ## File path: graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala ## @@ -0,0 +1,309 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.scalatest.Matchers + +import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN} +import org.apache.spark.sql.{DataFrame, QueryTest} +import org.apache.spark.sql.test.SharedSparkSession + +abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession with Matchers { + + // Override in spark-cypher Review comment: There is not much information about this. Could you explain about this a little bit more? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333812890 ## File path: graph/api/src/test/java/org/apache/spark/graph/api/JavaPropertyGraphSuite.java ## @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api; + +import com.google.common.collect.Lists; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.test.TestSparkSession; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.apache.spark.sql.types.DataTypes.*; + +public abstract class JavaPropertyGraphSuite implements Serializable { + private transient TestSparkSession spark; + private transient CypherSession cypherSession; + + abstract CypherSession getCypherSession(SparkSession sparkSession); + + @Before + public void setUp() { +spark = new TestSparkSession(); +cypherSession = getCypherSession(spark); + } + + @After + public void tearDown() { +spark.stop(); +spark = null; + } + + @Test + public void testCreateFromNodeFrame() { +StructType personSchema = createSchema( +Lists.newArrayList("id", "name"), Review comment: We use 2-space indentation in Java code. Please update all the other instances. You can reference the existing code, `Java8APISuite.java`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#issuecomment-540883751 Ok, I see. We will wait quietly. Thank you all! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333811998 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala ## @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import org.apache.spark.sql.DataFrame + +/** + * Describes how to map a DataFrame to relationships. + * + * Each row in the DataFrame represents a relationship with the given relationship type. + * + * @param df DataFrame containing a single relationship in each row + * @param idColumn column that contains the relationship identifier + * @param sourceIdColumn column that contains the source node identifier of the relationship + * @param targetIdColumn column that contains the target node identifier of the relationship + * @param relationshipType relationship type that is assigned to all relationships + * @param properties mapping from property keys to corresponding columns + * @since 3.0.0 + */ +case class RelationshipFrame private[graph] ( +df: DataFrame, +idColumn: String, +sourceIdColumn: String, +targetIdColumn: String, +relationshipType: String, +properties: Map[String, String]) +extends GraphElementFrame { Review comment: Indentation? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas
HyukjinKwon edited a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas URL: https://github.com/apache/spark/pull/24405#issuecomment-540881610 @giamo mind updating PR? Sorry for my late response. Looks like we're getting closer to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
wangyum commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#issuecomment-540881448 Sorry @liucht-inspur This has nothing to do with authorization: http://apache-spark-developers-list.1001551.n3.nabble.com/build-system-IMPORTANT-northern-california-fire-danger-potential-power-outage-s-td28059.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas
AmplabJenkins removed a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas URL: https://github.com/apache/spark/pull/24405#issuecomment-531894129 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas
HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas URL: https://github.com/apache/spark/pull/24405#issuecomment-540881610 @giamo mind updating PR? looks like we're getting closer to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API
dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API URL: https://github.com/apache/spark/pull/24851#discussion_r333811076 ## File path: graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala ## @@ -0,0 +1,258 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graph.api + +import scala.collection.JavaConverters._ + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} Review comment: ```scala -import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession} +import org.apache.spark.sql.{DataFrame, SparkSession} ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas
HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas URL: https://github.com/apache/spark/pull/24405#issuecomment-540881369 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333810252 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -807,13 +808,17 @@ private[spark] object MapOutputTracker extends Logging { private val BROADCAST = 1 // Serialize an array of map output locations into an efficient byte format so that we can send - // it to reduce tasks. We do this by compressing the serialized bytes using GZIP. They will + // it to reduce tasks. We do this by compressing the serialized bytes using Zstd. They will // generally be pretty compressible because many map outputs will be on the same hostname. def serializeMapStatuses(statuses: Array[MapStatus], broadcastManager: BroadcastManager, isLocal: Boolean, minBroadcastSize: Int): (Array[Byte], Broadcast[Array[Byte]]) = { -val out = new ByteArrayOutputStream -out.write(DIRECT) -val objOut = new ObjectOutputStream(new GZIPOutputStream(out)) +import scala.language.reflectiveCalls +val out = new ByteArrayOutputStream(4096) { Review comment: Just curious, is there any specific reason to choose 4096 here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333809995 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -20,8 +20,9 @@ package org.apache.spark import java.io._ import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, ThreadPoolExecutor, TimeUnit} import java.util.concurrent.locks.ReentrantReadWriteLock -import java.util.zip.{GZIPInputStream, GZIPOutputStream} +import com.github.luben.zstd.ZstdInputStream +import com.github.luben.zstd.ZstdOutputStream Review comment: IIRC, these imports should fall into third party imports, thus under the scala imports according to the coding style. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333809807 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -822,18 +827,43 @@ private[spark] object MapOutputTracker extends Logging { } { objOut.close() } -val arr = out.toByteArray + +val arr: Array[Byte] = { + val compressedOut = new ByteArrayOutputStream(4096) + val zos = new ZstdOutputStream(compressedOut) + Utils.tryWithSafeFinally { +compressedOut.write(DIRECT) +zos.write(out.getBuf, 0, out.size()) Review comment: I think we can replace this call with `out.writeTo(zos)`? So that we can avoid exposing the `buf` directly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
dongjoon-hyun commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#issuecomment-540880356 Thank you for pinging me, @dbtsai . I'll take a look tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liucht-inspur edited a comment on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
liucht-inspur edited a comment on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#issuecomment-540864157 cc @wangyum We are waiting for spark's Jenkins to test,can you help to authorize please? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql
wangyum commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql URL: https://github.com/apache/spark/pull/26055#discussion_r333806629 ## File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/interval.sql ## @@ -0,0 +1,330 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- INTERVAL +-- https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/interval.sql + +-- SET DATESTYLE = 'ISO'; +-- SET IntervalStyle to postgres; + +-- check acceptance of "time zone style" +-- [SPARK-29369] Accept strings without `interval` prefix in casting to intervals +-- [SPARK-29370] Interval strings without explicit unit markings +-- SELECT INTERVAL '01:00' AS `One hour`; +-- SELECT INTERVAL '+02:00' AS `Two hours`; +-- SELECT INTERVAL '-08:00' AS `Eight hours`; +-- SELECT INTERVAL '-1 +02:03' AS `22 hours ago...`; +-- SELECT INTERVAL '-1 days +02:03' AS `22 hours ago...`; +-- [SPARK-29371] Support interval field values with fractional parts +-- SELECT INTERVAL '1.5 weeks' AS `Ten days twelve hours`; +-- SELECT INTERVAL '1.5 months' AS `One month 15 days`; +-- SELECT INTERVAL '10 years -11 month -12 days +13:14' AS `9 years...`; + +-- [SPARK-29382] Support the `INTERVAL` type by Parquet datasource Review comment: We do not support `CREATE TABLE t(f1 int)` in `sql/core` module: ``` -- !query 0 CREATE TABLE t(f1 int) -- !query 0 schema struct<> -- !query 0 output org.apache.spark.sql.AnalysisException Hive support is required to CREATE Hive TABLE (AS SELECT); ``` So I added `USING parquet` everywhere. How about change it to? ``` -- [SPARK-29382] Support writing `INTERVAL` type to datasource table ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter
HyukjinKwon closed pull request #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter URL: https://github.com/apache/spark/pull/19955 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory
HyukjinKwon edited a comment on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory URL: https://github.com/apache/spark/pull/25398#issuecomment-540870091 If you only target to fix Hive ser/de to respect compression, why don't you set Hive compression properly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory
HyukjinKwon commented on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory URL: https://github.com/apache/spark/pull/25398#issuecomment-540870091 If you only target to fix Hive ser/de to respect configuration, why don't you set Hive compression properly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory
HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory URL: https://github.com/apache/spark/pull/25398#discussion_r333802719 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -1396,6 +1396,16 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { compressed = false, properties = rowStorage.properties ++ fileStorage.properties) -(ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER)) +val fileFormat = extractFileFormat(fileStorage.serde) +(ctx.LOCAL != null, storage, Some(fileFormat)) + } + + private def extractFileFormat(serde: Option[String]): String = { Review comment: @Udbhav30, I more meant this seems not the right place to replace. You should add a configuration, and such replacement should be done in analysis/optimizing, not in the parser. If that's the case, then why don't you use `USING file_format` explicitly? Can you please describe what this PR target to fix clearly? Spark lately added a new PR template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE). It might be better to follow this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory
HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory URL: https://github.com/apache/spark/pull/25398#discussion_r333802719 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ## @@ -1396,6 +1396,16 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf) { compressed = false, properties = rowStorage.properties ++ fileStorage.properties) -(ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER)) +val fileFormat = extractFileFormat(fileStorage.serde) +(ctx.LOCAL != null, storage, Some(fileFormat)) + } + + private def extractFileFormat(serde: Option[String]): String = { Review comment: @Udbhav30, I more meant this seems not the right place to replace. You should add a configuration, and such replacement should be done in analysis/optimizing, not in the parser. If that's the case, then why don't you use `USING file_format` explicitly? Can you please describe what this PR target to fix clearly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #25961: [SPARK-29284][SQL] Adaptive query execution works correct…
HyukjinKwon closed pull request #25961: [SPARK-29284][SQL] Adaptive query execution works correct… URL: https://github.com/apache/spark/pull/25961 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25961: [SPARK-29284][SQL] Adaptive query execution works correct…
HyukjinKwon commented on issue #25961: [SPARK-29284][SQL] Adaptive query execution works correct… URL: https://github.com/apache/spark/pull/25961#issuecomment-540865669 @hn5092, please identify which PR fixed this, and see if it's feasible to backport. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page
liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page URL: https://github.com/apache/spark/pull/25994#issuecomment-540864157 cc @wangyum This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase
HyukjinKwon commented on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase URL: https://github.com/apache/spark/pull/25914#issuecomment-540863339 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase
AmplabJenkins removed a comment on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase URL: https://github.com/apache/spark/pull/25914#issuecomment-534461235 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client eg. https://github.com/fabric8io/kubernetes-client/issues/1384. In addition, the java operator uses standalone mode and I think k8s native integration is more efficient. Anyway I believe this should be another external project as many others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client eg. https://github.com/fabric8io/kubernetes-client/issues/1384. In addition, the java operator uses standalone mode and I think k8s native integration is more efficient .Anyway I believe this should be another external project as many others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client eg. https://github.com/fabric8io/kubernetes-client/issues/1384.Anyway I believe this should be another external project as many others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9
AngersZh commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 URL: https://github.com/apache/spark/pull/25721#issuecomment-540861739 @gatorsmile Thanks for your attention. Since there have been a lot of changes about sql/hive-thriftserver recently, so I stop this and work for other issue.I will follow @juliuszsompolski 's comment and fit it to current. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client eg. https://github.com/fabric8io/kubernetes-client/issues/1384. Anyway I believe this should be another external project as many others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client, I can provide more details. Anyway I believe this should be another external project as many others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but it does not make sense to make them part of the core project. Same for the Flink project (https://github.com/lyft/flinkk8soperator vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client, I can provide more details. So I believe this should be another external project. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser @liyinan926 There are many operator projects out there but does not make sense to be part of the core project. Same for the Flink project. Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client, I can provide more details. So I believe this should be another external project. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
skonto edited a comment on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk @jkremser There are many operator projects out there but does not make sense to be part of the core project. Same for the Flink project. Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. I also find go libs superior for the purpose of writing operators and there have been issues with the fabric8io client, I can provide more details. So I believe this should be another external project. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] skonto commented on issue #26075: [WIP][K8S] Spark operator
skonto commented on issue #26075: [WIP][K8S] Spark operator URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668 @dongjoon-hyun @holdenk There are many operator projects out there but does not make sense to be part of the core project. Same for the Flink project. Btw in other PRs people complained about the code base being to large so this is another argument to avoid adding a lot of "external" stuff. Moreover the resource manager stuff logically belong outside of this project but for historical reasons they were kept here since the effort for the common resource manager API never happened. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#issuecomment-540857594 Thanks @HyukjinKwon and others for reviewing! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql
MaxGekk commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql URL: https://github.com/apache/spark/pull/26055#discussion_r333788660 ## File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/interval.sql ## @@ -0,0 +1,330 @@ +-- +-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group +-- +-- +-- INTERVAL +-- https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/interval.sql + +-- SET DATESTYLE = 'ISO'; +-- SET IntervalStyle to postgres; + +-- check acceptance of "time zone style" +-- [SPARK-29369] Accept strings without `interval` prefix in casting to intervals +-- [SPARK-29370] Interval strings without explicit unit markings +-- SELECT INTERVAL '01:00' AS `One hour`; +-- SELECT INTERVAL '+02:00' AS `Two hours`; +-- SELECT INTERVAL '-08:00' AS `Eight hours`; +-- SELECT INTERVAL '-1 +02:03' AS `22 hours ago...`; +-- SELECT INTERVAL '-1 days +02:03' AS `22 hours ago...`; +-- [SPARK-29371] Support interval field values with fractional parts +-- SELECT INTERVAL '1.5 weeks' AS `Ten days twelve hours`; +-- SELECT INTERVAL '1.5 months' AS `One month 15 days`; +-- SELECT INTERVAL '10 years -11 month -12 days +13:14' AS `9 years...`; + +-- [SPARK-29382] Support the `INTERVAL` type by Parquet datasource Review comment: @dongjoon-hyun please, clarify this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
MaxGekk commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333788013 ## File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala ## @@ -333,4 +333,63 @@ class MapOutputTrackerSuite extends SparkFunSuite { rpcEnv.shutdown() } + ignore("Benchmarking `MapOutputTracker.serializeMapStatuses`") { Review comment: Are there any reasons for it is not implemented as regular benchmark using the `Benchmark` class? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
HyukjinKwon closed pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide
HyukjinKwon commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide URL: https://github.com/apache/spark/pull/26045#issuecomment-540849488 Merged to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9
gatorsmile commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9 URL: https://github.com/apache/spark/pull/25721#issuecomment-540845787 @AngersZh @wangyum Could you address the comment and move it forward? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on a change in pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation
gatorsmile commented on a change in pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation URL: https://github.com/apache/spark/pull/24922#discussion_r333772305 ## File path: sql/core/pom.xml ## @@ -147,6 +147,12 @@ mockito-core test + + + org.rocksdb + rocksdbjni Review comment: RocksDB might not be the best backend. Instead of adding the extra dependency, I think we should just do it as a separate third-party package. The community can always build their own backend based on their needs. Doing it is simple. Can you submit it to https://spark-packages.org/? cc @marmbrus @tdas @zsxwing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai edited a comment on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
dbtsai edited a comment on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#issuecomment-540829740 ping @dongjoon-hyun @holdenk @viirya @tgravescs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
dbtsai commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#discussion_r333769537 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -357,8 +358,8 @@ private[spark] abstract class MapOutputTracker(conf: SparkConf) extends Logging */ private[spark] class MapOutputTrackerMaster( conf: SparkConf, -broadcastManager: BroadcastManager, -isLocal: Boolean) +private[spark] val broadcastManager: BroadcastManager, +private[spark] val isLocal: Boolean) Review comment: Expose for benchmark code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
dbtsai commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085#issuecomment-540829740 ping @dongjoon-hyun @holdenk @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance URL: https://github.com/apache/spark/pull/26085 ### What changes were proposed in this pull request? Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time. The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array, and then the data will be compressed. For smaller `MapStatues`, we find that it gives 2x performance gain. Here is the benchmark result. 20k map outputs, and each has 500 blocks 1. ZStd two steps in this PR: 0.402 ops/ms, 89066 bytes 2. ZStd one step as the original approach: 0.370 ops/ms, 89069 bytes 3. GZip: 0.092 ops/ms, 217345 bytes 20k map outputs, and each has 5 blocks 1. ZStd two steps in this PR: 0.9 ops/ms, 75449 bytes 2. ZStd one step as the original approach: 0.38 ops/ms, 75452 bytes 3. GZip: 0.21 ops/ms, 160094 bytes ### Why are the changes needed? Decrease the time for serializing the `MapStatuses` in large scale job. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)
JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) URL: https://github.com/apache/spark/pull/26076#issuecomment-540824919 > That sounds like a good idea to me. Can we do that first? I'll prototype this. If I get it working then I'll open a second PR and will ping / link it here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on a change in pull request #26058: [SPARK-10614][core] Add monotonic time to Clock interface.
vanzin commented on a change in pull request #26058: [SPARK-10614][core] Add monotonic time to Clock interface. URL: https://github.com/apache/spark/pull/26058#discussion_r333755039 ## File path: core/src/main/scala/org/apache/spark/util/Clock.scala ## @@ -21,7 +21,14 @@ package org.apache.spark.util * An interface to represent clocks, so that they can be mocked out in unit tests. */ private[spark] trait Clock { + /** @return Current system time, in ms. */ def getTimeMillis(): Long + /** @return Current value of monotonic time source, in ns. */ + def nanoTime(): Long + /** Review comment: Old code didn't have them, but sure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on issue #26006: [SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path
rdblue commented on issue #26006: [SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path URL: https://github.com/apache/spark/pull/26006#issuecomment-540814215 +1 from me as well This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #25955: [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown
rdblue commented on a change in pull request #25955: [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown URL: https://github.com/apache/spark/pull/25955#discussion_r333747933 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -55,7 +56,52 @@ case class DataSourceV2Relation( } override def computeStats(): Statistics = { -val scan = newScanBuilder().build() +if (Utils.isTesting) { + // when testing, throw an exception if this computeStats method is called because stats should + // not be accessed before pushing the projection and filters to create a scan. otherwise, the + // stats are not accurate because they are based on a full table scan of all columns. + throw new UnsupportedOperationException( +s"BUG: computeStats called before pushdown on DSv2 relation: $name") +} else { + // when not testing, return stats because bad stats are better than failing a query + newScanBuilder() match { +case r: SupportsReportStatistics => + val statistics = r.estimateStatistics() + DataSourceV2Relation.transformV2Stats(statistics, None, conf.defaultSizeInBytes) +case _ => + Statistics(sizeInBytes = conf.defaultSizeInBytes) + } +} + } + + override def newInstance(): DataSourceV2Relation = { +copy(output = output.map(_.newInstance())) + } +} + +/** + * A logical plan for a DSv2 table with a scan already created. + * + * This is used in the optimizer to push filters and projection down before conversion to physical + * plan. This ensures that the stats that are used by the optimizer account for the filters and + * projection that will be pushed down. + * + * @param table a DSv2 [[Table]] + * @param scan a DSv2 [[Scan]] + * @param output the output attributes of this relation + */ +case class DataSourceV2ScanRelation( Review comment: This should be a separate node to avoid correctness problems. Otherwise, it is easy to accidentally write rules that match both `DataSourceV2Relation` and `DataSourceV2ScanRelation` but does not handle the case where operators have already been pushed. When a filter is pushed down, it is also removed from the filters on top of the scan. If push-down happens a second time because rules match the same node, then it is easy for a mistake to ignore the original pushed filter and create a second independent scan. That's a correctness bug that is easy to introduce by accident. Using a separate relation type requires rules to choose whether to support a relation after push-down, or just a relation before push-down. The trade-off is that some places need to match both, but those cases are few and worth the trade. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] planga82 opened a new pull request #26084: [SPARK-29433][WebUI] Fix tooltip stages table
planga82 opened a new pull request #26084: [SPARK-29433][WebUI] Fix tooltip stages table URL: https://github.com/apache/spark/pull/26084 ### What changes were proposed in this pull request? In the Web UI, Stages table, the tool tip of Input and output column are not corrrect. Actual tooltip messages: Bytes and records read from Hadoop or from Spark storage. Bytes and records written to Hadoop. In this column we are only showing bytes, not records ![image](https://user-images.githubusercontent.com/12819544/66608286-85a0e480-ebb6-11e9-812a-9760bea53664.png) ![image](https://user-images.githubusercontent.com/12819544/66608323-96515a80-ebb6-11e9-9e5f-e3f2cc99a3b3.png) ![image](https://user-images.githubusercontent.com/12819544/66608450-de707d00-ebb6-11e9-84e3-0917b5cfe6f6.png) ![image](https://user-images.githubusercontent.com/12819544/66608468-eaf4d580-ebb6-11e9-8c5b-2a9a290bea9c.png) ### Why are the changes needed? Simple correction of a tooltip ### Does this PR introduce any user-facing change? Yes, tooltip correction ### How was this patch tested? Manual testing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column URL: https://github.com/apache/spark/pull/26064#discussion_r333737235 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala ## @@ -69,66 +83,117 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("3.0.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("3.0.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema, logging = true) -val schema = dataset.schema -val inputType = schema($(inputCol)).dataType -val td = $(threshold) -val metadata = outputSchema($(outputCol)).metadata - -val binarizerUDF = inputType match { - case DoubleType => -udf { in: Double => if (in > td) 1.0 else 0.0 } - - case _: VectorUDT if td >= 0 => -udf { vector: Vector => - val indices = ArrayBuilder.make[Int] - val values = ArrayBuilder.make[Double] - vector.foreachActive { (index, value) => -if (value > td) { - indices += index - values += 1.0 + +val (inputColNames, outputColNames, tds) = + if (isSet(inputCols)) { +if (isSet(thresholds)) { + ($(inputCols).toSeq, $(outputCols).toSeq, $(thresholds).toSeq) +} else { + ($(inputCols).toSeq, $(outputCols).toSeq, Seq.fill($(inputCols).length)($(threshold))) +} + } else { +(Seq($(inputCol)), Seq($(outputCol)), Seq($(threshold))) + } + +val ouputCols = inputColNames.zip(tds).map { case (inputColName, td) => + val binarizerUDF = dataset.schema(inputColName).dataType match { +case DoubleType => + udf { in: Double => if (in > td) 1.0 else 0.0 } + +case _: VectorUDT if td >= 0 => + udf { vector: Vector => +val indices = ArrayBuilder.make[Int] +val values = ArrayBuilder.make[Double] +vector.foreachActive { (index, value) => + if (value > td) { +indices += index +values += 1.0 + } } +Vectors.sparse(vector.size, indices.result(), values.result()).compressed } - Vectors.sparse(vector.size, indices.result(), values.result()).compressed -} - case _: VectorUDT if td < 0 => -this.logWarning(s"Binarization operations on sparse dataset with negative threshold " + - s"$td will build a dense output, so take care when applying to sparse input.") -udf { vector: Vector => - val values = Array.fill(vector.size)(1.0) - vector.foreachActive { (index, value) => -if (value <= td) { - values(index) = 0.0 +case _: VectorUDT if td < 0 => + this.logWarning(s"Binarization operations on sparse dataset with negative threshold " + +s"$td will build a dense output, so take care when applying to sparse input.") + udf { vector: Vector => +val values = Array.fill(vector.size)(1.0) Review comment: Nit, while we're here. Is it maybe faster to start with an array of 0, and flip the ones that exceed the threshold? just because the array is already initialized to 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column URL: https://github.com/apache/spark/pull/26064#discussion_r333737897 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala ## @@ -69,66 +83,117 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("1.4.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("3.0.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("3.0.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { val outputSchema = transformSchema(dataset.schema, logging = true) -val schema = dataset.schema -val inputType = schema($(inputCol)).dataType -val td = $(threshold) -val metadata = outputSchema($(outputCol)).metadata - -val binarizerUDF = inputType match { - case DoubleType => -udf { in: Double => if (in > td) 1.0 else 0.0 } - - case _: VectorUDT if td >= 0 => -udf { vector: Vector => - val indices = ArrayBuilder.make[Int] - val values = ArrayBuilder.make[Double] - vector.foreachActive { (index, value) => -if (value > td) { - indices += index - values += 1.0 + +val (inputColNames, outputColNames, tds) = Review comment: Do you also need to check if outputCols and inputCols are set together? or is that already done elsewhere now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on issue #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
huaxingao commented on issue #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column URL: https://github.com/apache/spark/pull/26064#issuecomment-540787575 I have a general question: What is the criteria for adding multi-column support to a ML algorithm? Right now ```Bucketizer```, ```StringIndexer```, ```QuantileDiscretizer```, and this ```Binarizer``` support multi-column. Do we need to add multi-column support to other algorithms? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column URL: https://github.com/apache/spark/pull/26064#discussion_r333727044 ## File path: mllib/src/test/scala/org/apache/spark/ml/feature/BinarizerSuite.scala ## @@ -122,5 +122,118 @@ class BinarizerSuite extends MLTest with DefaultReadWriteTest { .setOutputCol("myOutputCol") .setThreshold(0.1) testDefaultReadWrite(t) + +val t2 = new Binarizer() + .setInputCols(Array("input1", "input2", "input3")) + .setOutputCols(Array("result1", "result2", "result3")) + .setThresholds(Array(30.0, 30.0, 30.0)) +testDefaultReadWrite(t2) + } + + test("Multiple Columns: Test thresholds") { +val thresholds = Array(10.0, -0.5, 0.0) + +val data1 = Seq(5.0, 11.0) +val expected1 = Seq(0.0, 1.0) +val data2 = Seq(Vectors.sparse(3, Array(1), Array(0.5)), + Vectors.dense(Array(0.0, 0.5, 0.0))) +val expected2 = Seq(Vectors.dense(Array(1.0, 1.0, 1.0)), + Vectors.dense(Array(1.0, 1.0, 1.0))) +val data3 = Seq(0.0, 1.0) +val expected3 = Seq(0.0, 1.0) + +val df = Seq(0, 1).map { idx => + (data1(idx), data2(idx), data3(idx), expected1(idx), expected2(idx), expected3(idx)) +}.toDF("input1", "input2", "input3", "expected1", "expected2", "expected3") + +val binarizer = new Binarizer() + .setInputCols(Array("input1", "input2", "input3")) + .setOutputCols(Array("result1", "result2", "result3")) + .setThresholds(thresholds) + +binarizer.transform(df) + .select("result1", "expected1", "result2", "expected2", "result3", "expected3") + .collect().foreach { + case Row(r1: Double, e1: Double, r2: Vector, e2: Vector, r3: Double, e3: Double) => +assert(r1 === e1, + s"The result value is not correct after bucketing. Expected $e1 but found $r1") +assert(r2 === e2, + s"The result value is not correct after bucketing. Expected $e2 but found $r2") +assert(r3 === e3, + s"The result value is not correct after bucketing. Expected $e3 but found $r3") +} + } + + test("Multiple Columns: Comparing setting threshold with setting thresholds " + +"explicitly with identical values") { +val data1 = Array.range(1, 21, 1).map(_.toDouble) +val data2 = Array.range(1, 40, 2).map(_.toDouble) +val data3 = Array.range(1, 60, 3).map(_.toDouble) +val df = (0 until 20).map { idx => + (data1(idx), data2(idx), data3(idx)) +}.toDF("input1", "input2", "input3") + +val binarizerSingleThreshold = new Binarizer() + .setInputCols(Array("input1", "input2", "input3")) + .setOutputCols(Array("result1", "result2", "result3")) + .setThreshold(30.0) + +val df2 = binarizerSingleThreshold.transform(df) + +val binarizerMultiThreshold = new Binarizer() + .setInputCols(Array("input1", "input2", "input3")) + .setOutputCols(Array("expected1", "expected2", "expected3")) + .setThresholds(Array(30.0, 30.0, 30.0)) + +binarizerMultiThreshold.transform(df2) + .select("result1", "expected1", "result2", "expected2", "result3", "expected3") + .collect().foreach { + case Row(r1: Double, e1: Double, r2: Double, e2: Double, r3: Double, e3: Double) => +assert(r1 === e1, + s"The result value is not correct after bucketing. Expected $e1 but found $r1") +assert(r2 === e2, + s"The result value is not correct after bucketing. Expected $e2 but found $r2") +assert(r3 === e3, + s"The result value is not correct after bucketing. Expected $e3 but found $r3") +} + } + + test("Multiple Columns: Mismatched sizes of inputCols/outputCols") { +val binarizer = new Binarizer() + .setInputCols(Array("input")) + .setOutputCols(Array("result1", "result2")) + .setThreshold(1.0) +val df = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)) + .map(Tuple1.apply).toDF("input") +intercept[IllegalArgumentException] { + binarizer.transform(df).count() +} + } + + test("Multiple Columns: Mismatched sizes of inputCols/thresholds") { +val binarizer = new Binarizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setThresholds(Array(1.0, 2.0, 3.0)) +val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0) +val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0) +val df = data1.zip(data2).toSeq.toDF("input1", "input2") +intercept[IllegalArgumentException] { + binarizer.transform(df).count() +} + } + + test("Multiple Columns: Set both of threshold/thresholds") { +val binarizer = new Binarizer() + .setInputCols(Array("input1", "input2")) + .setOutputCols(Array("result1", "result2")) + .setThresholds(Array(1.0, 2.0)) + .setThreshold(1.0) +val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0,
[GitHub] [spark] huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column
huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column URL: https://github.com/apache/spark/pull/26064#discussion_r333725521 ## File path: python/pyspark/ml/feature.py ## @@ -65,7 +65,8 @@ @inherit_doc -class Binarizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable): +class Binarizer(JavaTransformer, HasThreshold, HasThresholds, HasInputCol, HasOutputCol, +HasInputCols, HasOutputCols, JavaMLReadable, JavaMLWritable): Review comment: Since multi-column is supported now, maybe change the below docstring to reflect the multi-column support? And also change the Scala doc? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#discussion_r333716820 ## File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala ## @@ -76,6 +109,29 @@ private[spark] class JobDataWrapper( @JsonIgnore @KVIndex("completionTime") private def completionTime: Long = info.completionTime.map(_.getTime).getOrElse(-1L) + + def toLiveJob: LiveJob = { Review comment: you can get multiple taskend events for the same `taskInfo.index` however, so you actually need those indices This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener
squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener URL: https://github.com/apache/spark/pull/25943#issuecomment-540769871 > Hmm...This PR doesn't store live entities, instead, it recovers (or say, restore) live entities from a KVStore I see, this is probably the key part I was missing. So you're saying the KVStore already has enough info, this PR just repopulates the in-memory version of all the LiveEntities? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling
tgravescs commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling URL: https://github.com/apache/spark/pull/26078#discussion_r333715025 ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala ## @@ -25,10 +25,13 @@ import org.apache.spark.resource.{ResourceAllocator, ResourceInformation} * information. * @param name Resource name * @param addresses Resource addresses provided by the executor + * @param numParts Number of ways each resource is subdivided when scheduling tasks */ -private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String]) +private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String], + numParts: Int) extends ResourceInformation(name, addresses.toArray) with ResourceAllocator { override protected def resourceName = this.name override protected def resourceAddresses = this.addresses + override protected def resourcesPerAddress = numParts Review comment: sure slotsPerAddress sounds good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows
viirya commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows URL: https://github.com/apache/spark/pull/26013#discussion_r333711709 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -501,4 +513,88 @@ trait Row extends Serializable { private def getAnyValAs[T <: AnyVal](i: Int): T = if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is null") else getAs[T](i) + + /** The compact JSON representation of this row. */ + def json: String = compact(jsonValue) + + /** The pretty (i.e. indented) JSON representation of this row. */ + def prettyJson: String = pretty(render(jsonValue)) + + /** + * JSON representation of the row. + * + * Note that this only supports the data types that are also supported by + * [[org.apache.spark.sql.catalyst.encoders.RowEncoder]]. + * + * @return the JSON representation of the row. + */ + private[sql] def jsonValue: JValue = { +require(schema != null, "JSON serialization requires a non-null schema.") + +lazy val timeZone = TimeZone.getTimeZone(SQLConf.get.sessionLocalTimeZone) +lazy val dateFormatter = DateFormatter.apply(timeZone.toZoneId) +lazy val timestampFormatter = TimestampFormatter.apply(timeZone.toZoneId) + +// Convert an iterator of values to a json array +def iteratorToJsonArray(iterator: Iterator[_], elementType: DataType): JArray = { + JArray(iterator.map(toJson(_, elementType)).toList) +} + +// Convert a value to json. +def toJson(value: Any, dataType: DataType): JValue = (value, dataType) match { + case (null, _) => JNull + case (b: Boolean, _) => JBool(b) + case (b: Byte, _) => JLong(b) + case (s: Short, _) => JLong(s) + case (i: Int, _) => JLong(i) + case (l: Long, _) => JLong(l) + case (f: Float, _) => JDouble(f) + case (d: Double, _) => JDouble(d) + case (d: BigDecimal, _) => JDecimal(d) + case (d: java.math.BigDecimal, _) => JDecimal(d) + case (d: Decimal, _) => JDecimal(d.toBigDecimal) + case (s: String, _) => JString(s) + case (b: Array[Byte], BinaryType) => +JString(Base64.getEncoder.encodeToString(b)) + case (d: LocalDate, _) => +JString(dateFormatter.format(DateTimeUtils.localDateToDays(d))) + case (d: Date, _) => +JString(dateFormatter.format(DateTimeUtils.fromJavaDate(d))) + case (i: Instant, _) => +JString(timestampFormatter.format(DateTimeUtils.instantToMicros(i))) + case (t: Timestamp, _) => +JString(timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))) + case (a: Array[_], ArrayType(elementType, _)) => +iteratorToJsonArray(a.iterator, elementType) + case (s: Seq[_], ArrayType(elementType, _)) => +iteratorToJsonArray(s.iterator, elementType) + case (m: Map[String @unchecked, _], MapType(StringType, valueType, _)) => Review comment: Can other primitive types like Int be good for this format too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling
abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling URL: https://github.com/apache/spark/pull/26078#discussion_r333704705 ## File path: core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala ## @@ -30,27 +30,35 @@ trait ResourceAllocator { protected def resourceName: String protected def resourceAddresses: Seq[String] + protected def resourcesPerAddress: Int /** - * Map from an address to its availability, the value `true` means the address is available, - * while value `false` means the address is assigned. + * Map from an address to its availability, a value > 0 means the address is available, + * while value of 0 means the address is fully assigned. + * + * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), this value + * can be fractional. Review comment: good catch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling
abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling URL: https://github.com/apache/spark/pull/26078#discussion_r333704288 ## File path: core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala ## @@ -25,10 +25,13 @@ import org.apache.spark.resource.{ResourceAllocator, ResourceInformation} * information. * @param name Resource name * @param addresses Resource addresses provided by the executor + * @param numParts Number of ways each resource is subdivided when scheduling tasks */ -private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String]) +private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String], + numParts: Int) extends ResourceInformation(name, addresses.toArray) with ResourceAllocator { override protected def resourceName = this.name override protected def resourceAddresses = this.addresses + override protected def resourcesPerAddress = numParts Review comment: One issue here is that this would be part of the `ResourceAllocator` interface, of which there are two `WorkerResourceInfo`, which sets the `resourcesPerAddress` to 1, and `ExecutorResourceInfo`, which is the variable one. Could we all it something like `slotsPerAddress`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service URL: https://github.com/apache/spark/pull/26000#discussion_r333690639 ## File path: docs/running-on-yarn.md ## @@ -492,6 +492,13 @@ To use a custom metrics.properties for the application master and executors, upd If it is not set then the YARN application ID is used. + + spark.yarn.shuffle.service.name + spark_shuffle + +Name of the external shuffle service. Review comment: many newbie's aren't familiar with what external shuffle service is or even yarn so its best to be clear. How about: The name of the external shuffle service. The external shuffle service itself is configured and started by YARN (see Configuring the External Shuffle Service for details). The name specified here must match the name YARN used. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service URL: https://github.com/apache/spark/pull/26000#discussion_r333691945 ## File path: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ## @@ -136,7 +136,11 @@ private DB db; public YarnShuffleService() { -super("spark_shuffle"); +this("spark_shuffle"); + } + + protected YarnShuffleService(String serviceName) { Review comment: So the name by itself isn't going to be enough. If you really want it configurable we are going to have to have the port configurable. For instance the config name for the port spark.shuffle.service.port needs to be able to be something like spark.shuffle.service.{serviceName}.port. Otherwise all the spark shuffle servers will try to get the same port and fail. The only other option will be to use 0 for ephemeral but This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service URL: https://github.com/apache/spark/pull/26000#discussion_r333656098 ## File path: common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java ## @@ -136,7 +136,11 @@ private DB db; public YarnShuffleService() { -super("spark_shuffle"); +this("spark_shuffle"); + } + + protected YarnShuffleService(String serviceName) { +super(serviceName); logger.info("Initializing YARN shuffle service for Spark"); Review comment: lets change the log statement to have the servicename in it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] abellina commented on issue #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling
abellina commented on issue #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling URL: https://github.com/apache/spark/pull/26078#issuecomment-540728997 Thanks @tgravescs will address. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #26059: [SPARK-29398][core] Support dedicated thread pools for RPC endpoints.
squito commented on a change in pull request #26059: [SPARK-29398][core] Support dedicated thread pools for RPC endpoints. URL: https://github.com/apache/spark/pull/26059#discussion_r333683292 ## File path: core/src/main/scala/org/apache/spark/rpc/RpcEndpoint.scala ## @@ -146,3 +146,13 @@ private[spark] trait RpcEndpoint { * [[ThreadSafeRpcEndpoint]] for different messages. */ private[spark] trait ThreadSafeRpcEndpoint extends RpcEndpoint + +/** + * An endpoint that uses a dedicated thread pool for delivering messages. + */ +private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint { + + /** How many threads to use for delivering messages. By default, use a single thread. */ + def threadCount(): Int = 1 Review comment: I'm trying to wrap my head around what happens if you create an `IsolatedRpcEndpoint` with threadCount() > 1, given the code in `Inbox` which checks for inheritance from `ThreadSafeRpcEndpoint`: https://github.com/apache/spark/blob/2b3c3793c97dc0e48f1e098c6f1038f51607b716/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala#L123 I guess if you expect one endpoint to be served by multiple threads, it makes sense you'd want `Inbox.enableConcurrent = false` and you'd have to make your endpoint safe for that -- but worth a comment here at least. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ifilonenko commented on issue #25785: [SPARK-27812][CORE] Explicit System.exit after job's main
ifilonenko commented on issue #25785: [SPARK-27812][CORE] Explicit System.exit after job's main URL: https://github.com/apache/spark/pull/25785#issuecomment-540721815 hey @igorcalabria, do you have a link to your PR in `kubernetes-client`? This problem affects our spark-on-k8s jobs and I would like to have a timeline on the patch or offer help where I can. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.
srowen commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference. URL: https://github.com/apache/spark/pull/25561#discussion_r333672276 ## File path: docs/sql-ref-syntax-aux-show-tables.md ## @@ -18,5 +18,90 @@ license: | See the License for the specific language governing permissions and limitations under the License. --- +### Description -**This page is under construction** +The `SHOW TABLES` statement returns all the tables for an optionally specified database. +Additionally, the output of this statement may be filtered by an optional matching +pattern. If no database is specified then the tables are returned from the +current database. + +### Syntax +{% highlight sql %} +SHOW TABLES [{FROM|IN} database_name] [LIKE 'regex_pattern'] +{% endhighlight %} + +### Parameters + + {FROM|IN} database_name + + Specifies the database name from which tables are listed. + + LIKE regex_pattern + + Specifies the regular expression pattern that is used to filter out unwanted tables. + + Except for * and | character, the pattern works like a regex. Review comment: OK, I might put back the back-ticks or quotes around special chars here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org