date:20191010

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error

2019-10-10 Thread GitBox

HyukjinKwon commented on a change in pull request #25333: [SPARK-28597][SS] Add 
config to retry spark streaming's meta log when it met error
URL: https://github.com/apache/spark/pull/25333#discussion_r333822084
 
 

 ##
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecutionSuite.scala
 ##
 @@ -68,4 +72,43 @@ class MicroBatchExecutionSuite extends StreamTest with 
BeforeAndAfter {
   CheckNewAnswer((25, 1), (30, 1))   // This should not throw the error 
reported in SPARK-24156
 )
   }
+
+  test("Add config to retry spark streaming's meta log when it met") {
+val s = MemoryStream[Int]
+// Specified checkpointLocation manually to init metadata file
+val tmp =
+  new File(System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getCanonicalPath
 
 Review comment:
   You can just use `withTempDir` and `withTempPath`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #25333: [SPARK-28597][SS] Add config to retry spark streaming's meta log when it met error

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #25333: [SPARK-28597][SS] Add config to retry 
spark streaming's meta log when it met error
URL: https://github.com/apache/spark/pull/25333#issuecomment-540898145
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

viirya commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333778200
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
 ##
 @@ -20,8 +20,9 @@ package org.apache.spark
 import java.io._
 import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, 
ThreadPoolExecutor, TimeUnit}
 import java.util.concurrent.locks.ReentrantReadWriteLock
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
 
+import com.github.luben.zstd.ZstdInputStream
+import com.github.luben.zstd.ZstdOutputStream
 
 Review comment:
   I recall we put third party imports under java and scala.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

viirya commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333821336
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
 ##
 @@ -333,4 +333,63 @@ class MapOutputTrackerSuite extends SparkFunSuite {
 rpcEnv.shutdown()
   }
 
+  ignore("Benchmarking `MapOutputTracker.serializeMapStatuses`") {
+val newConf = new SparkConf
+
+// needs TorrentBroadcast so need a SparkContext
+withSpark(new SparkContext("local", "MapOutputTrackerSuite", newConf)) { 
sc =>
+  val tracker = 
sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
+  val rpcEnv = sc.env.rpcEnv
+  val masterEndpoint = new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, 
newConf)
+  rpcEnv.stop(tracker.trackerEndpoint)
+  rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME, masterEndpoint)
 
 Review comment:
   Do we need to set up those if we just benchmark serializeMapStatuses?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liyinan926 commented on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

liyinan926 commented on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540893571
 
 
   Agreed with what @skonto said. This doesn't feel like something that is 
necessarily part of core Spark. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #25929: [SPARK-29116][PYTHON][ML] Refactor py classes related to DecisionTree

2019-10-10 Thread GitBox

zhengruifeng commented on issue #25929: [SPARK-29116][PYTHON][ML] Refactor py 
classes related to DecisionTree
URL: https://github.com/apache/spark/pull/25929#issuecomment-540890576
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to data source before planning

2019-10-10 Thread GitBox

cloud-fan commented on issue #19424: [SPARK-22197][SQL] push down operators to 
data source before planning
URL: https://github.com/apache/spark/pull/19424#issuecomment-540890347
 
 
   Usually a data source scans its data incrementally. So when the query has a 
limit, Spark stops consuming the iterator from data source, and data source 
won't scan all the data.
   
   But limit pushdown does have use cases. For now the only way is to write a 
catalyst rule, to catch the limit + scan query plan, and convert it to a custom 
query plan with limit pushed down.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shivusondur commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.

2019-10-10 Thread GitBox

shivusondur commented on a change in pull request #25561: 
[SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.
URL: https://github.com/apache/spark/pull/25561#discussion_r333816964
 
 

 ##
 File path: docs/sql-ref-syntax-aux-show-tables.md
 ##
 @@ -18,5 +18,90 @@ license: |
   See the License for the specific language governing permissions and
   limitations under the License.
 ---
+### Description
 
-**This page is under construction**
+The `SHOW TABLES` statement returns all the tables for an optionally specified 
database.
+Additionally, the output of this statement may be filtered by an optional 
matching
+pattern. If no database is specified then the tables are returned from the 
+current database.
+
+### Syntax
+{% highlight sql %}
+SHOW TABLES [{FROM|IN} database_name] [LIKE 'regex_pattern']
+{% endhighlight %}
+
+### Parameters
+
+  {FROM|IN} database_name
+  
+ Specifies the database name from which tables are listed.
+  
+  LIKE regex_pattern
+  
+ Specifies the regular expression pattern that is used to filter out 
unwanted tables. 
+  
+   Except for * and | character, the pattern works like a 
regex.
 
 Review comment:
   Done. The same correction I also interested. thank you
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815564
 
 

 ##
 File path: 
graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala
 ##
 @@ -0,0 +1,309 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.scalatest.Matchers
+
+import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, 
LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN}
+import org.apache.spark.sql.{DataFrame, QueryTest}
+import org.apache.spark.sql.test.SharedSparkSession
+
+abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession 
with Matchers {
+
+  // Override in spark-cypher
+  type IdType = Long
+  def convertId(inputId: Long): IdType
+
+  def cypherSession: CypherSession
+
+  lazy val nodes: DataFrame = spark
+.createDataFrame(
+  Seq(
+(0L, true, true, false, false, Some(42), Some("Alice"), None, None),
+(1L, true, true, false, false, Some(23), Some("Bob"), None, None),
+(2L, true, false, true, false, Some(22), Some("Carol"), Some("CS"), 
None),
+(3L, true, true, false, false, Some(19), Some("Eve"), None, None),
+(4L, false, false, false, true, None, None, None, Some("UC Berkeley")),
+(5L, false, false, false, true, None, None, None, Some("Stanford"
+.toDF(
+  ID_COLUMN,
+  label("Person"),
+  label("Student"),
+  label("Teacher"),
+  label("University"),
+  "age",
+  "name",
+  "subject",
+  "title")
+
+  lazy val relationships: DataFrame = spark
 
 Review comment:
   ```scala
   -  lazy val relationships: DataFrame = spark
   +  lazy val relationships: Dataset[Row] = spark
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815513
 
 

 ##
 File path: 
graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala
 ##
 @@ -0,0 +1,309 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.scalatest.Matchers
+
+import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, 
LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN}
+import org.apache.spark.sql.{DataFrame, QueryTest}
+import org.apache.spark.sql.test.SharedSparkSession
+
+abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession 
with Matchers {
+
+  // Override in spark-cypher
+  type IdType = Long
+  def convertId(inputId: Long): IdType
+
+  def cypherSession: CypherSession
+
+  lazy val nodes: DataFrame = spark
 
 Review comment:
   ```scala
   -  lazy val nodes: DataFrame = spark
   +  lazy val nodes: Dataset[Row] = spark
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815461
 
 

 ##
 File path: 
graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala
 ##
 @@ -0,0 +1,309 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.scalatest.Matchers
+
+import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, 
LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN}
+import org.apache.spark.sql.{DataFrame, QueryTest}
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.{DataFrame, QueryTest}
   +import org.apache.spark.sql.{Dataset, QueryTest, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815301
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrameBuilder.scala
 ##
 @@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Interface used to build a [[RelationshipFrame]].
+ *
+ * @param df DataFrame containing a single relationship in each row
+ * @since 3.0.0
+ */
+final class RelationshipFrameBuilder(val df: DataFrame) {
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815222
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrameBuilder.scala
 ##
 @@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815189
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala
 ##
 @@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Describes how to map a DataFrame to relationships.
+ *
+ * Each row in the DataFrame represents a relationship with the given 
relationship type.
+ *
+ * @param df   DataFrame containing a single relationship in each 
row
+ * @param idColumn column that contains the relationship identifier
+ * @param sourceIdColumn   column that contains the source node identifier of 
the relationship
+ * @param targetIdColumn   column that contains the target node identifier of 
the relationship
+ * @param relationshipType relationship type that is assigned to all 
relationships
+ * @param properties   mapping from property keys to corresponding columns
+ * @since 3.0.0
+ */
+case class RelationshipFrame private[graph] (
+df: DataFrame,
 
 Review comment:
   ```scala
   -df: DataFrame,
   +df: Dataset[Row],
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815075
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala
 ##
 @@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * A Property Graph as defined by the openCypher Property Graph Data Model.
+ *
+ * A graph is always tied to and managed by a [[CypherSession]].
+ * The lifetime of a graph is bound by the session lifetime.
+ *
+ * @see http://www.opencypher.org/;>openCypher project
+ * @see https://dl.acm.org/citation.cfm?id=3183713.3190657;>Property 
Graph Model
+ * @since 3.0.0
+ */
+abstract class PropertyGraph {
+
+  /**
+   * The schema (graph type) describes the structure of this graph.
+   *
+   * @since 3.0.0
+   */
+  def schema: PropertyGraphType
+
+  /**
+   * The session in which this graph is managed.
+   *
+   * @since 3.0.0
+   */
+  def cypherSession: CypherSession
+
+  /**
+   * Executes a Cypher query in the session that manages this graph, using 
this graph as
+   * the input graph.
+   *
+   * @param query Cypher query to execute
+   * @since 3.0.0
+   */
+  def cypher(query: String): CypherResult = cypher(query, Map.empty[String, 
Any])
+
+  /**
+   * Executes a Cypher query in the session that manages this graph, using 
this graph as
+   * the input graph.
+   *
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(query: String, parameters: Map[String, Any]): CypherResult =
+cypherSession.cypher(this, query, parameters)
+
+  /**
+   * Executes a Cypher query in the [[CypherSession]] that manages this graph, 
using this graph as
+   * the input graph.
+   *
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(query: String, parameters: java.util.Map[String, Object]): 
CypherResult =
+cypherSession.cypher(this, query, parameters)
+
+  /**
+   * Returns the [[NodeFrame]] for a given node label set.
+   *
+   * @param labelSet Label set used for NodeFrame lookup
+   * @return NodeFrame for the given label set
+   * @since 3.0.0
+   */
+  def nodeFrame(labelSet: Array[String]): NodeFrame
+
+  /**
+   * Returns the [[RelationshipFrame]] for a given relationship type.
+   *
+   * @param relationshipType Relationship type used for RelationshipFrame 
lookup
+   * @return RelationshipFrame for the given relationship type
+   * @since 3.0.0
+   */
+  def relationshipFrame(relationshipType: String): RelationshipFrame
+
+  /**
+   * Returns a DataFrame that contains a row for each node in this graph.
+   *
+   * The DataFrame adheres to the following column naming conventions:
+   *
+   * {{{
+   * Id column:`$ID`
+   * Label columns:`:{LABEL_NAME}`
+   * Property columns: `{Property_Key}`
+   * }}}
+   *
+   * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, 
relationships)`
+   * @since 3.0.0
+   */
+  def nodes: DataFrame
+
+  /**
+   * Returns a DataFrame that contains a row for each relationship in this
+   * graph.
+   *
+   * The DataFrame adheres to column naming conventions:
+   *
+   * {{{
+   * Id column:`$ID`
+   * SourceId column:  `$SOURCE_ID`
+   * TargetId column:  `$TARGET_ID`
+   * RelType columns:  `:{REL_TYPE}`
+   * Property columns: `{Property_Key}`
+   * }}}
+   *
+   * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, 
relationships)`
+   * @since 3.0.0
+   */
+  def relationships: DataFrame
 
 Review comment:
   ```scala
   -  def relationships: DataFrame
   +  def relationships: Dataset[Row]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815049
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala
 ##
 @@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * A Property Graph as defined by the openCypher Property Graph Data Model.
+ *
+ * A graph is always tied to and managed by a [[CypherSession]].
+ * The lifetime of a graph is bound by the session lifetime.
+ *
+ * @see http://www.opencypher.org/;>openCypher project
+ * @see https://dl.acm.org/citation.cfm?id=3183713.3190657;>Property 
Graph Model
+ * @since 3.0.0
+ */
+abstract class PropertyGraph {
+
+  /**
+   * The schema (graph type) describes the structure of this graph.
+   *
+   * @since 3.0.0
+   */
+  def schema: PropertyGraphType
+
+  /**
+   * The session in which this graph is managed.
+   *
+   * @since 3.0.0
+   */
+  def cypherSession: CypherSession
+
+  /**
+   * Executes a Cypher query in the session that manages this graph, using 
this graph as
+   * the input graph.
+   *
+   * @param query Cypher query to execute
+   * @since 3.0.0
+   */
+  def cypher(query: String): CypherResult = cypher(query, Map.empty[String, 
Any])
+
+  /**
+   * Executes a Cypher query in the session that manages this graph, using 
this graph as
+   * the input graph.
+   *
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(query: String, parameters: Map[String, Any]): CypherResult =
+cypherSession.cypher(this, query, parameters)
+
+  /**
+   * Executes a Cypher query in the [[CypherSession]] that manages this graph, 
using this graph as
+   * the input graph.
+   *
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(query: String, parameters: java.util.Map[String, Object]): 
CypherResult =
+cypherSession.cypher(this, query, parameters)
+
+  /**
+   * Returns the [[NodeFrame]] for a given node label set.
+   *
+   * @param labelSet Label set used for NodeFrame lookup
+   * @return NodeFrame for the given label set
+   * @since 3.0.0
+   */
+  def nodeFrame(labelSet: Array[String]): NodeFrame
+
+  /**
+   * Returns the [[RelationshipFrame]] for a given relationship type.
+   *
+   * @param relationshipType Relationship type used for RelationshipFrame 
lookup
+   * @return RelationshipFrame for the given relationship type
+   * @since 3.0.0
+   */
+  def relationshipFrame(relationshipType: String): RelationshipFrame
+
+  /**
+   * Returns a DataFrame that contains a row for each node in this graph.
+   *
+   * The DataFrame adheres to the following column naming conventions:
+   *
+   * {{{
+   * Id column:`$ID`
+   * Label columns:`:{LABEL_NAME}`
+   * Property columns: `{Property_Key}`
+   * }}}
+   *
+   * @see `org.apache.spark.graph.api.CypherSession.createGraph(nodes, 
relationships)`
+   * @since 3.0.0
+   */
+  def nodes: DataFrame
 
 Review comment:
   ```scala
   -  def nodes: DataFrame
   +  def nodes: Dataset[Row]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333815169
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala
 ##
 @@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814921
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrameBuilder.scala
 ##
 @@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Interface used to build a [[NodeFrame]].
+ *
+ * @param df DataFrame containing a single node in each row
+ * @since 3.0.0
+ */
+final class NodeFrameBuilder(var df: DataFrame) {
 
 Review comment:
   ```scala
   -final class NodeFrameBuilder(var df: DataFrame) {
   +final class NodeFrameBuilder(var df: Dataset[Row]) {
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814998
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/PropertyGraph.scala
 ##
 @@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814906
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrameBuilder.scala
 ##
 @@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814839
 
 

 ##
 File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrame.scala
 ##
 @@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814876
 
 

 ##
 File path: graph/api/src/main/scala/org/apache/spark/graph/api/NodeFrame.scala
 ##
 @@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Describes how to map a DataFrame to nodes.
+ *
+ * Each row in the DataFrame represents a node which has exactly the labels 
defined by the given
+ * label set.
+ *
+ * @param df DataFrame containing a single node in each row
+ * @param idColumn   column that contains the node identifier
+ * @param labelSet   labels that are assigned to all nodes
+ * @param properties mapping from property keys to corresponding columns
+ * @since 3.0.0
+ */
+case class NodeFrame private[graph] (
+df: DataFrame,
 
 Review comment:
   ```scala
   -df: DataFrame,
   +df: Dataset[Row],
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814769
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/GraphElementFrame.scala
 ##
 @@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814743
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala
 ##
 @@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.types.{BooleanType, StructType}
+
+/**
+ * Contains constants used for convention based column naming.
+ */
+object CypherSession {
+
+  /**
+   * Naming convention for identifier columns, both node and relationship 
identifiers.
+   */
+  val ID_COLUMN = "$ID"
+
+  /**
+   * Naming convention for relationship source identifier.
+   */
+  val SOURCE_ID_COLUMN = "$SOURCE_ID"
+
+  /**
+   * Naming convention for relationship target identifier.
+   */
+  val TARGET_ID_COLUMN = "$TARGET_ID"
+
+  /**
+   * Naming convention both for node label and relationship type prefixes.
+   */
+  val LABEL_COLUMN_PREFIX = ":"
+}
+
+/**
+ * A CypherSession allows for creating, storing and loading [[PropertyGraph]] 
instances as well as
+ * executing Cypher queries on them.
+ *
+ * Wraps a [[org.apache.spark.sql.SparkSession]].
+ *
+ * @since 3.0.0
+ */
+trait CypherSession extends Logging {
+
+  def sparkSession: SparkSession
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * @param graph [[PropertyGraph]] on which the query is executed
+   * @param query Cypher query to execute
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * Note that queries can take optional parameters:
+   *
+   * {{{
+   * Parameters:
+   *
+   * {
+   *"name" : "Alice"
+   * }
+   *
+   * Query:
+   *
+   * MATCH (n:Person)
+   * WHERE n.name = $name
+   * RETURN n
+   * }}}
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String, parameters: Map[String, 
Any]): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * Note that queries can take optional parameters:
+   *
+   * {{{
+   * Parameters:
+   *
+   * {
+   *"name" : "Alice"
+   * }
+   *
+   * Query:
+   *
+   * MATCH (n:Person)
+   * WHERE n.name = $name
+   * RETURN n
+   * }}}
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(
+  graph: PropertyGraph,
+  query: String,
+  parameters: java.util.Map[String, Object]): CypherResult = {
+cypher(graph, query, parameters.asScala.toMap)
+  }
+
+  /**
+   * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and 
[[RelationshipFrame]]s.
+   * At least one [[NodeFrame]] has to be provided.
+   *
+   * For each label set and relationship type there can be at most one 
[[NodeFrame]] and at most one
+   * [[RelationshipFrame]], respectively.
+   *
+   * @param nodes NodeFrames that define the nodes in the graph
+   * @param relationships RelationshipFrames that define the relationships in 
the graph
+   * @since 3.0.0
+   */
+  def createGraph(nodes: Array[NodeFrame], relationships: 
Array[RelationshipFrame]): PropertyGraph
+
+  /**
+   * Creates a [[PropertyGraph]] from nodes and relationships.
+   *
+   * The given DataFrames need to adhere to the following column naming 
conventions:
+   *
+   * {{{
+   * Id column:`$ID`(nodes and relationships)
+   * SourceId column:  `$SOURCE_ID` (relationships)
+   * TargetId column:  `$TARGET_ID` (relationships)
+   *
+   * Label columns:`:{LABEL_NAME}`  (nodes)
+   * RelType columns:

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814802
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/GraphElementFrame.scala
 ##
 @@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * A [[PropertyGraph]] is created from GraphElementFrames.
+ *
+ * A graph element is either a node or a relationship.
+ * A GraphElementFrame wraps a DataFrame and describes how it maps to graph 
elements.
+ *
+ * @since 3.0.0
+ */
+abstract class GraphElementFrame {
+
+  /**
+   * Initial DataFrame that can still contain unmapped, arbitrarily ordered 
columns.
+   *
+   * @since 3.0.0
+   */
+  def df: DataFrame
 
 Review comment:
   ```scala
   -  def df: DataFrame
   +  def df: Dataset[Row]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814672
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala
 ##
 @@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+
+object CypherSession {
+  val ID_COLUMN = "$ID"
+  val SOURCE_ID_COLUMN = "$SOURCE_ID"
+  val TARGET_ID_COLUMN = "$TARGET_ID"
+  val LABEL_COLUMN_PREFIX = ":"
+}
+
+/**
+ * The entry point for using property graphs in Spark.
+ *
+ * Provides factory methods for creating [[PropertyGraph]] instances.
+ *
+ * Wraps a [[org.apache.spark.sql.SparkSession]].
+ *
+ * @since 3.0.0
+ */
+trait CypherSession {
+
+  def sparkSession: SparkSession
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * @param graph [[PropertyGraph]] on which the query is executed
+   * @param query Cypher query to execute
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String, parameters: Map[String, 
Any]): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph,
+ query: String,
+ parameters: java.util.Map[String, Object]): CypherResult = {
+cypher(graph, query, parameters.asScala.toMap)
+  }
+
+  /**
+   * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and 
[[RelationshipFrame]]s.
+   * At least one [[NodeFrame]] has to be provided.
+   *
+   * For each label set and relationship type there can be at most one 
[[NodeFrame]] and at most one
+   * [[RelationshipFrame]], respectively.
+   *
+   * @param nodes NodeFrames that define the nodes in the graph
+   * @param relationships RelationshipFrames that define the relationships in 
the graph
+   * @since 3.0.0
+   */
+  def createGraph(nodes: Seq[NodeFrame], relationships: 
Seq[RelationshipFrame]): PropertyGraph
+
+  /**
+   * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and 
[[RelationshipFrame]]s.
+   * At least one [[NodeFrame]] has to be provided.
+   *
+   * For each label set and relationship type there can be at most one 
[[NodeFrame]] and at most one
+   * [[RelationshipFrame]], respectively.
+   *
+   * @param nodes NodeFrames that define the nodes in the graph
+   * @param relationships RelationshipFrames that define the relationships in 
the graph
+   * @since 3.0.0
+   */
+  def createGraph(
+  nodes: java.util.List[NodeFrame],
+  relationships: java.util.List[RelationshipFrame]): PropertyGraph = {
+createGraph(nodes.asScala, relationships.asScala)
+  }
+
+  /**
+   * Creates a [[PropertyGraph]] from nodes and relationships.
+   *
+   * The given DataFrames need to adhere to the following column naming 
conventions:
+   *
+   * {{{
+   * Id column:`$ID`(nodes and relationships)
+   * SourceId column:  `$SOURCE_ID` (relationships)
+   * TargetId column:  `$TARGET_ID` (relationships)
+   *
+   * Label columns:`:{LABEL_NAME}`  (nodes)
+   * RelType columns:  `:{REL_TYPE}`(relationships)
+   *
+   * Property columns: `{Property_Key}` (nodes and relationships)
+   * }}}
+   *
+   * @see [[CypherSession]]
+   * @param nodes node DataFrame
+   * @param relationships relationship DataFrame
+   * @since 3.0.0
+   */
+  def createGraph(nodes: DataFrame, relationships: DataFrame): PropertyGraph = 
{
 
 Review comment:
   ```scala
   -

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814596
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala
 ##
 @@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
 
 Review comment:
   Actually, the following for the final update.
   ```scala
   -import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
   +import org.apache.spark.sql.{Dataset, Row, SparkSession}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814713
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala
 ##
 @@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
+import org.apache.spark.sql.types.{BooleanType, StructType}
+
+/**
+ * Contains constants used for convention based column naming.
+ */
+object CypherSession {
+
+  /**
+   * Naming convention for identifier columns, both node and relationship 
identifiers.
+   */
+  val ID_COLUMN = "$ID"
+
+  /**
+   * Naming convention for relationship source identifier.
+   */
+  val SOURCE_ID_COLUMN = "$SOURCE_ID"
+
+  /**
+   * Naming convention for relationship target identifier.
+   */
+  val TARGET_ID_COLUMN = "$TARGET_ID"
+
+  /**
+   * Naming convention both for node label and relationship type prefixes.
+   */
+  val LABEL_COLUMN_PREFIX = ":"
+}
+
+/**
+ * A CypherSession allows for creating, storing and loading [[PropertyGraph]] 
instances as well as
+ * executing Cypher queries on them.
+ *
+ * Wraps a [[org.apache.spark.sql.SparkSession]].
+ *
+ * @since 3.0.0
+ */
+trait CypherSession extends Logging {
+
+  def sparkSession: SparkSession
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * @param graph [[PropertyGraph]] on which the query is executed
+   * @param query Cypher query to execute
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * Note that queries can take optional parameters:
+   *
+   * {{{
+   * Parameters:
+   *
+   * {
+   *"name" : "Alice"
+   * }
+   *
+   * Query:
+   *
+   * MATCH (n:Person)
+   * WHERE n.name = $name
+   * RETURN n
+   * }}}
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(graph: PropertyGraph, query: String, parameters: Map[String, 
Any]): CypherResult
+
+  /**
+   * Executes a Cypher query on the given input graph.
+   *
+   * Note that queries can take optional parameters:
+   *
+   * {{{
+   * Parameters:
+   *
+   * {
+   *"name" : "Alice"
+   * }
+   *
+   * Query:
+   *
+   * MATCH (n:Person)
+   * WHERE n.name = $name
+   * RETURN n
+   * }}}
+   *
+   * @param graph  [[PropertyGraph]] on which the query is executed
+   * @param query  Cypher query to execute
+   * @param parameters parameters used by the Cypher query
+   * @since 3.0.0
+   */
+  def cypher(
+  graph: PropertyGraph,
+  query: String,
+  parameters: java.util.Map[String, Object]): CypherResult = {
+cypher(graph, query, parameters.asScala.toMap)
+  }
+
+  /**
+   * Creates a [[PropertyGraph]] from a sequence of [[NodeFrame]]s and 
[[RelationshipFrame]]s.
+   * At least one [[NodeFrame]] has to be provided.
+   *
+   * For each label set and relationship type there can be at most one 
[[NodeFrame]] and at most one
+   * [[RelationshipFrame]], respectively.
+   *
+   * @param nodes NodeFrames that define the nodes in the graph
+   * @param relationships RelationshipFrames that define the relationships in 
the graph
+   * @since 3.0.0
+   */
+  def createGraph(nodes: Array[NodeFrame], relationships: 
Array[RelationshipFrame]): PropertyGraph
+
+  /**
+   * Creates a [[PropertyGraph]] from nodes and relationships.
+   *
+   * The given DataFrames need to adhere to the following column naming 
conventions:
+   *
+   * {{{
+   * Id column:`$ID`(nodes and relationships)
+   * SourceId column:  `$SOURCE_ID` (relationships)
+   * TargetId column:  `$TARGET_ID` (relationships)
+   *
+   * Label columns:`:{LABEL_NAME}`  (nodes)
+   * RelType columns:

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814508
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherResult.scala
 ##
 @@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Result of a Cypher query.
+ *
+ * Wraps a DataFrame that contains the result rows.
+ *
+ * @since 3.0.0
+ */
+trait CypherResult {
+
+  /**
+   * Contains the result rows.
+   *
+   * The column names are aligned with the return item names specified within 
the Cypher query,
+   * (e.g. `RETURN foo, bar AS baz` results in the columns `foo` and `baz`).
+   *
+   * @note Dot characters (i.e. `.`) within return item names are replaced by 
an underscore (`_`),
+   *   (e.g. `MATCH (n:Person) RETURN n` results in the columns `n`, 
`n:Person` and `n_name`).
+   * @since 3.0.0
+   */
+  def df: DataFrame
 
 Review comment:
   ```scala
   -  def df: DataFrame
   +  def df: Dataset[Row]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333814464
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherResult.scala
 ##
 @@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.DataFrame
   +import org.apache.spark.sql.{Dataset, Row}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333813221
 
 

 ##
 File path: 
graph/api/src/test/scala/org/apache/spark/graph/api/PropertyGraphSuite.scala
 ##
 @@ -0,0 +1,309 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.scalatest.Matchers
+
+import org.apache.spark.graph.api.CypherSession.{ID_COLUMN, 
LABEL_COLUMN_PREFIX, SOURCE_ID_COLUMN, TARGET_ID_COLUMN}
+import org.apache.spark.sql.{DataFrame, QueryTest}
+import org.apache.spark.sql.test.SharedSparkSession
+
+abstract class PropertyGraphSuite extends QueryTest with SharedSparkSession 
with Matchers {
+
+  // Override in spark-cypher
 
 Review comment:
   There is not much information about this. Could you explain about this a 
little bit more?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333812890
 
 

 ##
 File path: 
graph/api/src/test/java/org/apache/spark/graph/api/JavaPropertyGraphSuite.java
 ##
 @@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api;
+
+import com.google.common.collect.Lists;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.test.TestSparkSession;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static org.apache.spark.sql.types.DataTypes.*;
+
+public abstract class JavaPropertyGraphSuite implements Serializable {
+  private transient TestSparkSession spark;
+  private transient CypherSession cypherSession;
+
+  abstract CypherSession getCypherSession(SparkSession sparkSession);
+
+  @Before
+  public void setUp() {
+spark = new TestSparkSession();
+cypherSession = getCypherSession(spark);
+  }
+
+  @After
+  public void tearDown() {
+spark.stop();
+spark = null;
+  }
+
+  @Test
+  public void testCreateFromNodeFrame() {
+StructType personSchema = createSchema(
+Lists.newArrayList("id", "name"),
 
 Review comment:
   We use 2-space indentation in Java code. Please update all the other 
instances. You can reference the existing code, `Java8APISuite.java`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-10 Thread GitBox

liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for 
The Executors Tab's column names in the Spark history server Page
URL: https://github.com/apache/spark/pull/25994#issuecomment-540883751
 
 
   Ok, I see. We will wait quietly. Thank you all!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333811998
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/RelationshipFrame.scala
 ##
 @@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import org.apache.spark.sql.DataFrame
+
+/**
+ * Describes how to map a DataFrame to relationships.
+ *
+ * Each row in the DataFrame represents a relationship with the given 
relationship type.
+ *
+ * @param df   DataFrame containing a single relationship in each 
row
+ * @param idColumn column that contains the relationship identifier
+ * @param sourceIdColumn   column that contains the source node identifier of 
the relationship
+ * @param targetIdColumn   column that contains the target node identifier of 
the relationship
+ * @param relationshipType relationship type that is assigned to all 
relationships
+ * @param properties   mapping from property keys to corresponding columns
+ * @since 3.0.0
+ */
+case class RelationshipFrame private[graph] (
+df: DataFrame,
+idColumn: String,
+sourceIdColumn: String,
+targetIdColumn: String,
+relationshipType: String,
+properties: Map[String, String])
+extends GraphElementFrame {
 
 Review comment:
   Indentation?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-10-10 Thread GitBox

HyukjinKwon edited a comment on issue #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#issuecomment-540881610
 
 
   @giamo mind updating PR? Sorry for my late response. Looks like we're 
getting closer to merge.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-10 Thread GitBox

wangyum commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The 
Executors Tab's column names in the Spark history server Page
URL: https://github.com/apache/spark/pull/25994#issuecomment-540881448
 
 
   Sorry @liucht-inspur This has nothing to do with authorization:
   
http://apache-spark-developers-list.1001551.n3.nabble.com/build-system-IMPORTANT-northern-california-fire-danger-potential-power-outage-s-td28059.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-10-10 Thread GitBox

AmplabJenkins removed a comment on issue #24405: [SPARK-27506][SQL] Allow 
deserialization of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#issuecomment-531894129
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization 
of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#issuecomment-540881610
 
 
   @giamo mind updating PR? looks like we're getting closer to merge.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #24851: [SPARK-27303][GRAPH] Add Spark Graph API

2019-10-10 Thread GitBox

dongjoon-hyun commented on a change in pull request #24851: 
[SPARK-27303][GRAPH] Add Spark Graph API
URL: https://github.com/apache/spark/pull/24851#discussion_r333811076
 
 

 ##
 File path: 
graph/api/src/main/scala/org/apache/spark/graph/api/CypherSession.scala
 ##
 @@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graph.api
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
 
 Review comment:
   ```scala
   -import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
   +import org.apache.spark.sql.{DataFrame, SparkSession}
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #24405: [SPARK-27506][SQL] Allow deserialization 
of Avro data using compatible schemas
URL: https://github.com/apache/spark/pull/24405#issuecomment-540881369
 
 
   ok to test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333810252
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
 ##
 @@ -807,13 +808,17 @@ private[spark] object MapOutputTracker extends Logging {
   private val BROADCAST = 1
 
   // Serialize an array of map output locations into an efficient byte format 
so that we can send
-  // it to reduce tasks. We do this by compressing the serialized bytes using 
GZIP. They will
+  // it to reduce tasks. We do this by compressing the serialized bytes using 
Zstd. They will
   // generally be pretty compressible because many map outputs will be on the 
same hostname.
   def serializeMapStatuses(statuses: Array[MapStatus], broadcastManager: 
BroadcastManager,
   isLocal: Boolean, minBroadcastSize: Int): (Array[Byte], 
Broadcast[Array[Byte]]) = {
-val out = new ByteArrayOutputStream
-out.write(DIRECT)
-val objOut = new ObjectOutputStream(new GZIPOutputStream(out))
+import scala.language.reflectiveCalls
+val out = new ByteArrayOutputStream(4096) {
 
 Review comment:
   Just curious, is there any specific reason to choose 4096 here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333809995
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
 ##
 @@ -20,8 +20,9 @@ package org.apache.spark
 import java.io._
 import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, 
ThreadPoolExecutor, TimeUnit}
 import java.util.concurrent.locks.ReentrantReadWriteLock
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
 
+import com.github.luben.zstd.ZstdInputStream
+import com.github.luben.zstd.ZstdOutputStream
 
 Review comment:
   IIRC, these imports should fall into third party imports, thus under the 
scala imports according to the coding style.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

advancedxy commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333809807
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
 ##
 @@ -822,18 +827,43 @@ private[spark] object MapOutputTracker extends Logging {
 } {
   objOut.close()
 }
-val arr = out.toByteArray
+
+val arr: Array[Byte] = {
+  val compressedOut = new ByteArrayOutputStream(4096)
+  val zos = new ZstdOutputStream(compressedOut)
+  Utils.tryWithSafeFinally {
+compressedOut.write(DIRECT)
+zos.write(out.getBuf, 0, out.size())
 
 Review comment:
   I think we can replace this call with `out.writeTo(zos)`?  So that we can 
avoid exposing the `buf` directly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

dongjoon-hyun commented on issue #26085: [SPARK-29434] [Core] Improve the 
MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#issuecomment-540880356
 
 
   Thank you for pinging me, @dbtsai . I'll take a look tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liucht-inspur edited a comment on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-10 Thread GitBox

liucht-inspur edited a comment on issue #25994: [SPARK-29323][WEBUI] Add 
tooltip for The Executors Tab's column names in the Spark history server Page
URL: https://github.com/apache/spark/pull/25994#issuecomment-540864157
 
 
   cc @wangyum We are waiting for spark's Jenkins to test,can you help to 
authorize please?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql

2019-10-10 Thread GitBox

wangyum commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] 
Port interval.sql
URL: https://github.com/apache/spark/pull/26055#discussion_r333806629
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/interval.sql
 ##
 @@ -0,0 +1,330 @@
+--
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+--
+-- INTERVAL
+-- 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/interval.sql
+
+-- SET DATESTYLE = 'ISO';
+-- SET IntervalStyle to postgres;
+
+-- check acceptance of "time zone style"
+-- [SPARK-29369] Accept strings without `interval` prefix in casting to 
intervals
+-- [SPARK-29370] Interval strings without explicit unit markings
+-- SELECT INTERVAL '01:00' AS `One hour`;
+-- SELECT INTERVAL '+02:00' AS `Two hours`;
+-- SELECT INTERVAL '-08:00' AS `Eight hours`;
+-- SELECT INTERVAL '-1 +02:03' AS `22 hours ago...`;
+-- SELECT INTERVAL '-1 days +02:03' AS `22 hours ago...`;
+-- [SPARK-29371] Support interval field values with fractional parts
+-- SELECT INTERVAL '1.5 weeks' AS `Ten days twelve hours`;
+-- SELECT INTERVAL '1.5 months' AS `One month 15 days`;
+-- SELECT INTERVAL '10 years -11 month -12 days +13:14' AS `9 years...`;
+
+-- [SPARK-29382] Support the `INTERVAL` type by Parquet datasource
 
 Review comment:
   We do not support `CREATE TABLE t(f1 int)` in `sql/core` module:
   ```
   -- !query 0
   CREATE TABLE t(f1 int)
   -- !query 0 schema
   struct<>
   -- !query 0 output
   org.apache.spark.sql.AnalysisException
   Hive support is required to CREATE Hive TABLE (AS SELECT);
   ```
   So I added `USING parquet` everywhere. How about change it to?
   ```
   -- [SPARK-29382] Support writing `INTERVAL` type to datasource table
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #19955: [SPARK-21867][CORE] Support async spilling in UnsafeShuffleWriter

2019-10-10 Thread GitBox

HyukjinKwon closed pull request #19955: [SPARK-21867][CORE] Support async 
spilling in UnsafeShuffleWriter
URL: https://github.com/apache/spark/pull/19955
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon edited a comment on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory

2019-10-10 Thread GitBox

HyukjinKwon edited a comment on issue #25398: [SPARK-28659][SQL] Use data 
source if convertible in insert overwrite directory
URL: https://github.com/apache/spark/pull/25398#issuecomment-540870091
 
 
   If you only target to fix Hive ser/de to respect compression, why don't you 
set Hive compression properly?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #25398: [SPARK-28659][SQL] Use data source if 
convertible in insert overwrite directory
URL: https://github.com/apache/spark/pull/25398#issuecomment-540870091
 
 
   If you only target to fix Hive ser/de to respect configuration, why don't 
you set Hive compression properly?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory

2019-10-10 Thread GitBox

HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] 
Use data source if convertible in insert overwrite directory
URL: https://github.com/apache/spark/pull/25398#discussion_r333802719
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
 ##
 @@ -1396,6 +1396,16 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
   compressed = false,
   properties = rowStorage.properties ++ fileStorage.properties)
 
-(ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER))
+val fileFormat = extractFileFormat(fileStorage.serde)
+(ctx.LOCAL != null, storage, Some(fileFormat))
+  }
+
+  private def extractFileFormat(serde: Option[String]): String = {
 
 Review comment:
   @Udbhav30, I more meant this seems not the right place to replace. You 
should add a configuration, and such replacement should be done in 
analysis/optimizing, not in the parser.
   
   If that's the case, then why don't you use `USING file_format` explicitly? 
Can you please describe what this PR target to fix clearly? Spark lately added 
a new PR template 
(https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE). It 
might be better to follow this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] Use data source if convertible in insert overwrite directory

2019-10-10 Thread GitBox

HyukjinKwon commented on a change in pull request #25398: [SPARK-28659][SQL] 
Use data source if convertible in insert overwrite directory
URL: https://github.com/apache/spark/pull/25398#discussion_r333802719
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala
 ##
 @@ -1396,6 +1396,16 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
   compressed = false,
   properties = rowStorage.properties ++ fileStorage.properties)
 
-(ctx.LOCAL != null, storage, Some(DDLUtils.HIVE_PROVIDER))
+val fileFormat = extractFileFormat(fileStorage.serde)
+(ctx.LOCAL != null, storage, Some(fileFormat))
+  }
+
+  private def extractFileFormat(serde: Option[String]): String = {
 
 Review comment:
   @Udbhav30, I more meant this seems not the right place to replace. You 
should add a configuration, and such replacement should be done in 
analysis/optimizing, not in the parser.
   
   If that's the case, then why don't you use `USING file_format` explicitly? 
Can you please describe what this PR target to fix clearly?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #25961: [SPARK-29284][SQL] Adaptive query execution works correct…

2019-10-10 Thread GitBox

HyukjinKwon closed pull request #25961: [SPARK-29284][SQL] Adaptive query 
execution works correct…
URL: https://github.com/apache/spark/pull/25961
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #25961: [SPARK-29284][SQL] Adaptive query execution works correct…

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #25961: [SPARK-29284][SQL] Adaptive query 
execution works correct…
URL: https://github.com/apache/spark/pull/25961#issuecomment-540865669
 
 
   @hn5092, please identify which PR fixed this, and see if it's feasible to 
backport.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-10 Thread GitBox

liucht-inspur commented on issue #25994: [SPARK-29323][WEBUI] Add tooltip for 
The Executors Tab's column names in the Spark history server Page
URL: https://github.com/apache/spark/pull/25994#issuecomment-540864157
 
 
   cc @wangyum 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #25914: [SPARK-29227][SS]Track rule info in 
optimization phase
URL: https://github.com/apache/spark/pull/25914#issuecomment-540863339
 
 
   ok to test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #25914: [SPARK-29227][SS]Track rule info in optimization phase

2019-10-10 Thread GitBox

AmplabJenkins removed a comment on issue #25914: [SPARK-29227][SS]Track rule 
info in optimization phase
URL: https://github.com/apache/spark/pull/25914#issuecomment-534461235
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. I also find go libs superior for the purpose of writing 
operators and there have been issues with the fabric8io client eg. 
https://github.com/fabric8io/kubernetes-client/issues/1384. In addition, the 
java operator uses standalone mode and I think k8s native integration is more 
efficient. Anyway I believe this should be another external project as many 
others. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. I also find go libs superior for the purpose of writing 
operators and there have been issues with the fabric8io client eg. 
https://github.com/fabric8io/kubernetes-client/issues/1384. In addition, the 
java operator uses standalone mode and I think k8s native integration is more 
efficient .Anyway I believe this should be another external project as many 
others. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. I also find go libs superior for the purpose of writing 
operators and there have been issues with the fabric8io client eg. 
https://github.com/fabric8io/kubernetes-client/issues/1384.Anyway I believe 
this should be another external project as many others. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9

2019-10-10 Thread GitBox

AngersZh commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark 
Thrift Server with it's own code base on PROTOCOL_VERSION_V9
URL: https://github.com/apache/spark/pull/25721#issuecomment-540861739
 
 
   @gatorsmile  Thanks for your attention. 
   Since there have been a lot of changes about sql/hive-thriftserver recently, 
 so I stop this and work for other issue.I will follow @juliuszsompolski 's 
comment and fit it to current. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. I also find go libs superior for the purpose of writing 
operators and there have been issues with the fabric8io client eg. 
https://github.com/fabric8io/kubernetes-client/issues/1384. Anyway I believe 
this should be another external project as many others. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. I also find go libs superior for the purpose of writing 
operators and there have been issues with the fabric8io client, I can provide 
more details. Anyway I believe this should be another external project as many 
others. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but it does not make sense to make them part of the core 
project. Same for the Flink project (https://github.com/lyft/flinkk8soperator 
vs https://github.com/GoogleCloudPlatform/flink-on-k8s-operator). Btw in other 
PRs people complained about the code base being to large so this is another 
argument to avoid adding a lot of "external" stuff. Moreover the resource 
manager stuff logically belong outside of this project but for historical 
reasons they were kept here since the effort for the common  resource manager 
API never happened. 
   I also find go libs superior for the purpose of writing operators and there 
have been issues with the fabric8io client, I can provide more details. So I 
believe this should be another external project. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser @liyinan926  There are many operator 
projects out there but does not make sense to be part of the core project. Same 
for the Flink project. Btw in other PRs people complained about the code base 
being to large so this is another argument to avoid adding a lot of "external" 
stuff. Moreover the resource manager stuff logically belong outside of this 
project but for historical reasons they were kept here since the effort for the 
common  resource manager API never happened. 
   I also find go libs superior for the purpose of writing operators and there 
have been issues with the fabric8io client, I can provide more details. So I 
believe this should be another external project. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto edited a comment on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto edited a comment on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk @jkremser  There are many operator projects out 
there but does not make sense to be part of the core project. Same for the 
Flink project. Btw in other PRs people complained about the code base being to 
large so this is another argument to avoid adding a lot of "external" stuff. 
Moreover the resource manager stuff logically belong outside of this project 
but for historical reasons they were kept here since the effort for the common  
resource manager API never happened. 
   I also find go libs superior for the purpose of writing operators and there 
have been issues with the fabric8io client, I can provide more details. So I 
believe this should be another external project. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] skonto commented on issue #26075: [WIP][K8S] Spark operator

2019-10-10 Thread GitBox

skonto commented on issue #26075: [WIP][K8S] Spark operator
URL: https://github.com/apache/spark/pull/26075#issuecomment-540858668
 
 
   @dongjoon-hyun  @holdenk There are many operator projects out there but does 
not make sense to be part of the core project. Same for the Flink project. Btw 
in other PRs people complained about the code base being to large so this is 
another argument to avoid adding a lot of "external" stuff. Moreover the 
resource manager stuff logically belong outside of this project but for 
historical reasons they were kept here since the effort for the common  
resource manager API never happened. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide

2019-10-10 Thread GitBox

BryanCutler commented on issue #26045: [SPARK-29367][DOC] Add compatibility 
note for Arrow 0.15.0 to SQL guide
URL: https://github.com/apache/spark/pull/26045#issuecomment-540857594
 
 
   Thanks @HyukjinKwon and others for reviewing!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] Port interval.sql

2019-10-10 Thread GitBox

MaxGekk commented on a change in pull request #26055: [SPARK-29368][SQL][TEST] 
Port interval.sql
URL: https://github.com/apache/spark/pull/26055#discussion_r333788660
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/interval.sql
 ##
 @@ -0,0 +1,330 @@
+--
+-- Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+--
+--
+-- INTERVAL
+-- 
https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/interval.sql
+
+-- SET DATESTYLE = 'ISO';
+-- SET IntervalStyle to postgres;
+
+-- check acceptance of "time zone style"
+-- [SPARK-29369] Accept strings without `interval` prefix in casting to 
intervals
+-- [SPARK-29370] Interval strings without explicit unit markings
+-- SELECT INTERVAL '01:00' AS `One hour`;
+-- SELECT INTERVAL '+02:00' AS `Two hours`;
+-- SELECT INTERVAL '-08:00' AS `Eight hours`;
+-- SELECT INTERVAL '-1 +02:03' AS `22 hours ago...`;
+-- SELECT INTERVAL '-1 days +02:03' AS `22 hours ago...`;
+-- [SPARK-29371] Support interval field values with fractional parts
+-- SELECT INTERVAL '1.5 weeks' AS `Ten days twelve hours`;
+-- SELECT INTERVAL '1.5 months' AS `One month 15 days`;
+-- SELECT INTERVAL '10 years -11 month -12 days +13:14' AS `9 years...`;
+
+-- [SPARK-29382] Support the `INTERVAL` type by Parquet datasource
 
 Review comment:
   @dongjoon-hyun please, clarify this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

MaxGekk commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333788013
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala
 ##
 @@ -333,4 +333,63 @@ class MapOutputTrackerSuite extends SparkFunSuite {
 rpcEnv.shutdown()
   }
 
+  ignore("Benchmarking `MapOutputTracker.serializeMapStatuses`") {
 
 Review comment:
   Are there any reasons for it is not implemented as regular benchmark using 
the `Benchmark` class?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide

2019-10-10 Thread GitBox

HyukjinKwon closed pull request #26045: [SPARK-29367][DOC] Add compatibility 
note for Arrow 0.15.0 to SQL guide
URL: https://github.com/apache/spark/pull/26045
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #26045: [SPARK-29367][DOC] Add compatibility note for Arrow 0.15.0 to SQL guide

2019-10-10 Thread GitBox

HyukjinKwon commented on issue #26045: [SPARK-29367][DOC] Add compatibility 
note for Arrow 0.15.0 to SQL guide
URL: https://github.com/apache/spark/pull/26045#issuecomment-540849488
 
 
   Merged to master.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark Thrift Server with it's own code base on PROTOCOL_VERSION_V9

2019-10-10 Thread GitBox

gatorsmile commented on issue #25721: [WIP][SPARK-29018][SQL] Implement Spark 
Thrift Server with it's own code base on PROTOCOL_VERSION_V9
URL: https://github.com/apache/spark/pull/25721#issuecomment-540845787
 
 
   @AngersZh @wangyum Could you address the comment and move it forward?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on a change in pull request #24922: [SPARK-28120][SS] Rocksdb state storage implementation

2019-10-10 Thread GitBox

gatorsmile commented on a change in pull request #24922: [SPARK-28120][SS]  
Rocksdb state storage implementation
URL: https://github.com/apache/spark/pull/24922#discussion_r333772305
 
 

 ##
 File path: sql/core/pom.xml
 ##
 @@ -147,6 +147,12 @@
   mockito-core
   test
 
+
+
+  org.rocksdb
+  rocksdbjni
 
 Review comment:
   RocksDB might not be the best backend. Instead of adding the extra 
dependency, I think we should just do it as a separate third-party package. The 
community can always build their own backend based on their needs. Doing it is 
simple. 
   
   Can you submit it to https://spark-packages.org/?
   
   cc @marmbrus @tdas @zsxwing 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dbtsai edited a comment on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

dbtsai edited a comment on issue #26085: [SPARK-29434] [Core] Improve the 
MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#issuecomment-540829740
 
 
   ping @dongjoon-hyun @holdenk @viirya @tgravescs 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dbtsai commented on a change in pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

dbtsai commented on a change in pull request #26085: [SPARK-29434] [Core] 
Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085#discussion_r333769537
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
 ##
 @@ -357,8 +358,8 @@ private[spark] abstract class MapOutputTracker(conf: 
SparkConf) extends Logging
  */
 private[spark] class MapOutputTrackerMaster(
 conf: SparkConf,
-broadcastManager: BroadcastManager,
-isLocal: Boolean)
+private[spark] val broadcastManager: BroadcastManager,
+private[spark] val isLocal: Boolean)
 
 Review comment:
   Expose for benchmark code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dbtsai commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

dbtsai commented on issue #26085: [SPARK-29434] [Core] Improve the MapStatuses 
Serialization Performance
URL: https://github.com/apache/spark/pull/26085#issuecomment-540829740
 
 
   ping @dongjoon-hyun @holdenk @viirya


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

2019-10-10 Thread GitBox

dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the 
MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085
 
 
   ### What changes were proposed in this pull request?
   Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd 
provides better compression rate and faster compression time.
   
   The original approach is serializing and writing data directly into 
`GZIPOutputStream` as one step; however, the compression time is faster if a 
bigger chuck of the data is processed by the codec at once. As a result, in 
this PR, the serialized data is written into an uncompressed byte array, and 
then the data will be compressed. For smaller `MapStatues`, we find that it 
gives 2x performance gain.
   
   Here is the benchmark result.
   
    20k map outputs, and each has 500 blocks
   1. ZStd two steps in this PR: 0.402 ops/ms, 89066 bytes
   2. ZStd one step as the original approach: 0.370 ops/ms, 89069 bytes
   3. GZip: 0.092 ops/ms, 217345 bytes
   
    20k map outputs, and each has 5 blocks
   1. ZStd two steps in this PR: 0.9 ops/ms, 75449 bytes
   2. ZStd one step as the original approach: 0.38 ops/ms, 75452 bytes
   3. GZip: 0.21 ops/ms, 160094 bytes
   
   ### Why are the changes needed?
   Decrease the time for serializing the `MapStatuses` in large scale job.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq)

2019-10-10 Thread GitBox

JoshRosen commented on issue #26076: [SPARK-29419][SQL] Fix Encoder 
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-540824919
 
 
   > That sounds like a good idea to me. Can we do that first?
   
   I'll prototype this. If I get it working then I'll open a second PR and will 
ping / link it here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] vanzin commented on a change in pull request #26058: [SPARK-10614][core] Add monotonic time to Clock interface.

2019-10-10 Thread GitBox

vanzin commented on a change in pull request #26058: [SPARK-10614][core] Add 
monotonic time to Clock interface.
URL: https://github.com/apache/spark/pull/26058#discussion_r333755039
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/util/Clock.scala
 ##
 @@ -21,7 +21,14 @@ package org.apache.spark.util
  * An interface to represent clocks, so that they can be mocked out in unit 
tests.
  */
 private[spark] trait Clock {
+  /** @return Current system time, in ms. */
   def getTimeMillis(): Long
+  /** @return Current value of monotonic time source, in ns. */
+  def nanoTime(): Long
+  /**
 
 Review comment:
   Old code didn't have them, but sure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rdblue commented on issue #26006: [SPARK-29279][SQL] Merge SHOW NAMESPACES and SHOW DATABASES code path

2019-10-10 Thread GitBox

rdblue commented on issue #26006: [SPARK-29279][SQL] Merge SHOW NAMESPACES and 
SHOW DATABASES code path
URL: https://github.com/apache/spark/pull/26006#issuecomment-540814215
 
 
   +1 from me as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rdblue commented on a change in pull request #25955: [SPARK-29277][SQL] Add early DSv2 filter and projection pushdown

2019-10-10 Thread GitBox

rdblue commented on a change in pull request #25955: [SPARK-29277][SQL] Add 
early DSv2 filter and projection pushdown
URL: https://github.com/apache/spark/pull/25955#discussion_r333747933
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ##
 @@ -55,7 +56,52 @@ case class DataSourceV2Relation(
   }
 
   override def computeStats(): Statistics = {
-val scan = newScanBuilder().build()
+if (Utils.isTesting) {
+  // when testing, throw an exception if this computeStats method is 
called because stats should
+  // not be accessed before pushing the projection and filters to create a 
scan. otherwise, the
+  // stats are not accurate because they are based on a full table scan of 
all columns.
+  throw new UnsupportedOperationException(
+s"BUG: computeStats called before pushdown on DSv2 relation: $name")
+} else {
+  // when not testing, return stats because bad stats are better than 
failing a query
+  newScanBuilder() match {
+case r: SupportsReportStatistics =>
+  val statistics = r.estimateStatistics()
+  DataSourceV2Relation.transformV2Stats(statistics, None, 
conf.defaultSizeInBytes)
+case _ =>
+  Statistics(sizeInBytes = conf.defaultSizeInBytes)
+  }
+}
+  }
+
+  override def newInstance(): DataSourceV2Relation = {
+copy(output = output.map(_.newInstance()))
+  }
+}
+
+/**
+ * A logical plan for a DSv2 table with a scan already created.
+ *
+ * This is used in the optimizer to push filters and projection down before 
conversion to physical
+ * plan. This ensures that the stats that are used by the optimizer account 
for the filters and
+ * projection that will be pushed down.
+ *
+ * @param table a DSv2 [[Table]]
+ * @param scan a DSv2 [[Scan]]
+ * @param output the output attributes of this relation
+ */
+case class DataSourceV2ScanRelation(
 
 Review comment:
   This should be a separate node to avoid correctness problems. Otherwise, it 
is easy to accidentally write rules that match both `DataSourceV2Relation` and 
`DataSourceV2ScanRelation` but does not handle the case where operators have 
already been pushed.
   
   When a filter is pushed down, it is also removed from the filters on top of 
the scan. If push-down happens a second time because rules match the same node, 
then it is easy for a mistake to ignore the original pushed filter and create a 
second independent scan. That's a correctness bug that is easy to introduce by 
accident.
   
   Using a separate relation type requires rules to choose whether to support a 
relation after push-down, or just a relation before push-down. The trade-off is 
that some places need to match both, but those cases are few and worth the 
trade.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] planga82 opened a new pull request #26084: [SPARK-29433][WebUI] Fix tooltip stages table

2019-10-10 Thread GitBox

planga82 opened a new pull request #26084: [SPARK-29433][WebUI] Fix tooltip 
stages table
URL: https://github.com/apache/spark/pull/26084
 
 
   ### What changes were proposed in this pull request?
   In the Web UI, Stages table, the tool tip of Input and output column are not 
corrrect.
   
   Actual tooltip messages: 
   
   Bytes and records read from Hadoop or from Spark storage.
   Bytes and records written to Hadoop.
   In this column we are only showing bytes, not records
   
   
![image](https://user-images.githubusercontent.com/12819544/66608286-85a0e480-ebb6-11e9-812a-9760bea53664.png)
   
![image](https://user-images.githubusercontent.com/12819544/66608323-96515a80-ebb6-11e9-9e5f-e3f2cc99a3b3.png)
   
![image](https://user-images.githubusercontent.com/12819544/66608450-de707d00-ebb6-11e9-84e3-0917b5cfe6f6.png)
   
![image](https://user-images.githubusercontent.com/12819544/66608468-eaf4d580-ebb6-11e9-8c5b-2a9a290bea9c.png)
   
   ### Why are the changes needed?
   Simple correction of a tooltip
   
   ### Does this PR introduce any user-facing change?
   Yes, tooltip correction
   
   
   ### How was this patch tested?
   Manual testing
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column

2019-10-10 Thread GitBox

srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] 
Binarizer support multi-column
URL: https://github.com/apache/spark/pull/26064#discussion_r333737235
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala
 ##
 @@ -69,66 +83,117 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("3.0.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("3.0.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema, logging = true)
-val schema = dataset.schema
-val inputType = schema($(inputCol)).dataType
-val td = $(threshold)
-val metadata = outputSchema($(outputCol)).metadata
-
-val binarizerUDF = inputType match {
-  case DoubleType =>
-udf { in: Double => if (in > td) 1.0 else 0.0 }
-
-  case _: VectorUDT if td >= 0 =>
-udf { vector: Vector =>
-  val indices = ArrayBuilder.make[Int]
-  val values = ArrayBuilder.make[Double]
-  vector.foreachActive { (index, value) =>
-if (value > td) {
-  indices += index
-  values +=  1.0
+
+val (inputColNames, outputColNames, tds) =
+  if (isSet(inputCols)) {
+if (isSet(thresholds)) {
+  ($(inputCols).toSeq, $(outputCols).toSeq, $(thresholds).toSeq)
+} else {
+  ($(inputCols).toSeq, $(outputCols).toSeq, 
Seq.fill($(inputCols).length)($(threshold)))
+}
+  } else {
+(Seq($(inputCol)), Seq($(outputCol)), Seq($(threshold)))
+  }
+
+val ouputCols = inputColNames.zip(tds).map { case (inputColName, td) =>
+  val binarizerUDF = dataset.schema(inputColName).dataType match {
+case DoubleType =>
+  udf { in: Double => if (in > td) 1.0 else 0.0 }
+
+case _: VectorUDT if td >= 0 =>
+  udf { vector: Vector =>
+val indices = ArrayBuilder.make[Int]
+val values = ArrayBuilder.make[Double]
+vector.foreachActive { (index, value) =>
+  if (value > td) {
+indices += index
+values +=  1.0
+  }
 }
+Vectors.sparse(vector.size, indices.result(), 
values.result()).compressed
   }
-  Vectors.sparse(vector.size, indices.result(), 
values.result()).compressed
-}
 
-  case _: VectorUDT if td < 0 =>
-this.logWarning(s"Binarization operations on sparse dataset with 
negative threshold " +
-  s"$td will build a dense output, so take care when applying to 
sparse input.")
-udf { vector: Vector =>
-  val values = Array.fill(vector.size)(1.0)
-  vector.foreachActive { (index, value) =>
-if (value <= td) {
-  values(index) = 0.0
+case _: VectorUDT if td < 0 =>
+  this.logWarning(s"Binarization operations on sparse dataset with 
negative threshold " +
+s"$td will build a dense output, so take care when applying to 
sparse input.")
+  udf { vector: Vector =>
+val values = Array.fill(vector.size)(1.0)
 
 Review comment:
   Nit, while we're here. Is it maybe faster to start with an array of 0, and 
flip the ones that exceed the threshold? just because the array is already 
initialized to 0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column

2019-10-10 Thread GitBox

srowen commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] 
Binarizer support multi-column
URL: https://github.com/apache/spark/pull/26064#discussion_r333737897
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala
 ##
 @@ -69,66 +83,117 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("3.0.0")
+  def setInputCols(value: Array[String]): this.type = set(inputCols, value)
+
+  /** @group setParam */
+  @Since("3.0.0")
+  def setOutputCols(value: Array[String]): this.type = set(outputCols, value)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 val outputSchema = transformSchema(dataset.schema, logging = true)
-val schema = dataset.schema
-val inputType = schema($(inputCol)).dataType
-val td = $(threshold)
-val metadata = outputSchema($(outputCol)).metadata
-
-val binarizerUDF = inputType match {
-  case DoubleType =>
-udf { in: Double => if (in > td) 1.0 else 0.0 }
-
-  case _: VectorUDT if td >= 0 =>
-udf { vector: Vector =>
-  val indices = ArrayBuilder.make[Int]
-  val values = ArrayBuilder.make[Double]
-  vector.foreachActive { (index, value) =>
-if (value > td) {
-  indices += index
-  values +=  1.0
+
+val (inputColNames, outputColNames, tds) =
 
 Review comment:
   Do you also need to check if outputCols and inputCols are set together? or 
is that already done elsewhere now?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on issue #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column

2019-10-10 Thread GitBox

huaxingao commented on issue #26064: [SPARK-23578][ML][PYSPARK] Binarizer 
support multi-column
URL: https://github.com/apache/spark/pull/26064#issuecomment-540787575
 
 
   I have a general question: What is the criteria for adding multi-column 
support to a ML algorithm? Right now ```Bucketizer```, ```StringIndexer```, 
```QuantileDiscretizer```, and this ```Binarizer``` support multi-column. Do we 
need to add multi-column support to other algorithms? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column

2019-10-10 Thread GitBox

huaxingao commented on a change in pull request #26064: 
[SPARK-23578][ML][PYSPARK] Binarizer support multi-column
URL: https://github.com/apache/spark/pull/26064#discussion_r333727044
 
 

 ##
 File path: 
mllib/src/test/scala/org/apache/spark/ml/feature/BinarizerSuite.scala
 ##
 @@ -122,5 +122,118 @@ class BinarizerSuite extends MLTest with 
DefaultReadWriteTest {
   .setOutputCol("myOutputCol")
   .setThreshold(0.1)
 testDefaultReadWrite(t)
+
+val t2 = new Binarizer()
+  .setInputCols(Array("input1", "input2", "input3"))
+  .setOutputCols(Array("result1", "result2", "result3"))
+  .setThresholds(Array(30.0, 30.0, 30.0))
+testDefaultReadWrite(t2)
+  }
+
+  test("Multiple Columns: Test thresholds") {
+val thresholds = Array(10.0, -0.5, 0.0)
+
+val data1 = Seq(5.0, 11.0)
+val expected1 = Seq(0.0, 1.0)
+val data2 = Seq(Vectors.sparse(3, Array(1), Array(0.5)),
+  Vectors.dense(Array(0.0, 0.5, 0.0)))
+val expected2 = Seq(Vectors.dense(Array(1.0, 1.0, 1.0)),
+  Vectors.dense(Array(1.0, 1.0, 1.0)))
+val data3 = Seq(0.0, 1.0)
+val expected3 = Seq(0.0, 1.0)
+
+val df = Seq(0, 1).map { idx =>
+  (data1(idx), data2(idx), data3(idx), expected1(idx), expected2(idx), 
expected3(idx))
+}.toDF("input1", "input2", "input3", "expected1", "expected2", "expected3")
+
+val binarizer = new Binarizer()
+  .setInputCols(Array("input1", "input2", "input3"))
+  .setOutputCols(Array("result1", "result2", "result3"))
+  .setThresholds(thresholds)
+
+binarizer.transform(df)
+  .select("result1", "expected1", "result2", "expected2", "result3", 
"expected3")
+  .collect().foreach {
+  case Row(r1: Double, e1: Double, r2: Vector, e2: Vector, r3: Double, e3: 
Double) =>
+assert(r1 === e1,
+  s"The result value is not correct after bucketing. Expected $e1 but 
found $r1")
+assert(r2 === e2,
+  s"The result value is not correct after bucketing. Expected $e2 but 
found $r2")
+assert(r3 === e3,
+  s"The result value is not correct after bucketing. Expected $e3 but 
found $r3")
+}
+  }
+
+  test("Multiple Columns: Comparing setting threshold with setting thresholds 
" +
+"explicitly with identical values") {
+val data1 = Array.range(1, 21, 1).map(_.toDouble)
+val data2 = Array.range(1, 40, 2).map(_.toDouble)
+val data3 = Array.range(1, 60, 3).map(_.toDouble)
+val df = (0 until 20).map { idx =>
+  (data1(idx), data2(idx), data3(idx))
+}.toDF("input1", "input2", "input3")
+
+val binarizerSingleThreshold = new Binarizer()
+  .setInputCols(Array("input1", "input2", "input3"))
+  .setOutputCols(Array("result1", "result2", "result3"))
+  .setThreshold(30.0)
+
+val df2 = binarizerSingleThreshold.transform(df)
+
+val binarizerMultiThreshold = new Binarizer()
+  .setInputCols(Array("input1", "input2", "input3"))
+  .setOutputCols(Array("expected1", "expected2", "expected3"))
+  .setThresholds(Array(30.0, 30.0, 30.0))
+
+binarizerMultiThreshold.transform(df2)
+  .select("result1", "expected1", "result2", "expected2", "result3", 
"expected3")
+  .collect().foreach {
+  case Row(r1: Double, e1: Double, r2: Double, e2: Double, r3: Double, e3: 
Double) =>
+assert(r1 === e1,
+  s"The result value is not correct after bucketing. Expected $e1 but 
found $r1")
+assert(r2 === e2,
+  s"The result value is not correct after bucketing. Expected $e2 but 
found $r2")
+assert(r3 === e3,
+  s"The result value is not correct after bucketing. Expected $e3 but 
found $r3")
+}
+  }
+
+  test("Multiple Columns: Mismatched sizes of inputCols/outputCols") {
+val binarizer = new Binarizer()
+  .setInputCols(Array("input"))
+  .setOutputCols(Array("result1", "result2"))
+  .setThreshold(1.0)
+val df = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
+  .map(Tuple1.apply).toDF("input")
+intercept[IllegalArgumentException] {
+  binarizer.transform(df).count()
+}
+  }
+
+  test("Multiple Columns: Mismatched sizes of inputCols/thresholds") {
+val binarizer = new Binarizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setThresholds(Array(1.0, 2.0, 3.0))
+val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0, 2.0)
+val data2 = Array(1.0, 2.0, 3.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 3.0)
+val df = data1.zip(data2).toSeq.toDF("input1", "input2")
+intercept[IllegalArgumentException] {
+  binarizer.transform(df).count()
+}
+  }
+
+  test("Multiple Columns: Set both of threshold/thresholds") {
+val binarizer = new Binarizer()
+  .setInputCols(Array("input1", "input2"))
+  .setOutputCols(Array("result1", "result2"))
+  .setThresholds(Array(1.0, 2.0))
+  .setThreshold(1.0)
+val data1 = Array(1.0, 3.0, 2.0, 1.0, 1.0, 2.0, 3.0,

[GitHub] [spark] huaxingao commented on a change in pull request #26064: [SPARK-23578][ML][PYSPARK] Binarizer support multi-column

2019-10-10 Thread GitBox

huaxingao commented on a change in pull request #26064: 
[SPARK-23578][ML][PYSPARK] Binarizer support multi-column
URL: https://github.com/apache/spark/pull/26064#discussion_r333725521
 
 

 ##
 File path: python/pyspark/ml/feature.py
 ##
 @@ -65,7 +65,8 @@
 
 
 @inherit_doc
-class Binarizer(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadable, 
JavaMLWritable):
+class Binarizer(JavaTransformer, HasThreshold, HasThresholds, HasInputCol, 
HasOutputCol,
+HasInputCols, HasOutputCols, JavaMLReadable, JavaMLWritable):
 
 Review comment:
   Since multi-column is supported now, maybe change the below docstring to 
reflect the multi-column support? And also change the Scala doc? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] squito commented on a change in pull request #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener

2019-10-10 Thread GitBox

squito commented on a change in pull request #25943: 
[WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for 
(SQL)AppStatusListener
URL: https://github.com/apache/spark/pull/25943#discussion_r333716820
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/status/storeTypes.scala
 ##
 @@ -76,6 +109,29 @@ private[spark] class JobDataWrapper(
 
   @JsonIgnore @KVIndex("completionTime")
   private def completionTime: Long = 
info.completionTime.map(_.getTime).getOrElse(-1L)
+
+  def toLiveJob: LiveJob = {
 
 Review comment:
   you can get multiple taskend events for the same `taskInfo.index` however, 
so you actually need those indices


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover live entities from KVStore for (SQL)AppStatusListener

2019-10-10 Thread GitBox

squito commented on issue #25943: [WIP][SPARK-29261][SQL][CORE] Support recover 
live entities from KVStore for (SQL)AppStatusListener
URL: https://github.com/apache/spark/pull/25943#issuecomment-540769871
 
 
   > Hmm...This PR doesn't store live entities, instead, it recovers (or say, 
restore) live entities from a KVStore
   
   I see, this is probably the key part I was missing.  So you're saying the 
KVStore already has enough info, this PR just repopulates the in-memory version 
of all the LiveEntities?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling

2019-10-10 Thread GitBox

tgravescs commented on a change in pull request #26078: [SPARK-29151][CORE] 
Support fractional resources for task resource scheduling
URL: https://github.com/apache/spark/pull/26078#discussion_r333715025
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala
 ##
 @@ -25,10 +25,13 @@ import org.apache.spark.resource.{ResourceAllocator, 
ResourceInformation}
  * information.
  * @param name Resource name
  * @param addresses Resource addresses provided by the executor
+ * @param numParts Number of ways each resource is subdivided when scheduling 
tasks
  */
-private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String])
+private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String],
+  numParts: Int)
   extends ResourceInformation(name, addresses.toArray) with ResourceAllocator {
 
   override protected def resourceName = this.name
   override protected def resourceAddresses = this.addresses
+  override protected def resourcesPerAddress = numParts
 
 Review comment:
   sure slotsPerAddress sounds good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #26013: [SPARK-29347][SQL] Add JSON serialization for external Rows

2019-10-10 Thread GitBox

viirya commented on a change in pull request #26013: [SPARK-29347][SQL] Add 
JSON serialization for external Rows
URL: https://github.com/apache/spark/pull/26013#discussion_r333711709
 
 

 ##
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
 ##
 @@ -501,4 +513,88 @@ trait Row extends Serializable {
   private def getAnyValAs[T <: AnyVal](i: Int): T =
 if (isNullAt(i)) throw new NullPointerException(s"Value at index $i is 
null")
 else getAs[T](i)
+
+  /** The compact JSON representation of this row. */
+  def json: String = compact(jsonValue)
+
+  /** The pretty (i.e. indented) JSON representation of this row. */
+  def prettyJson: String = pretty(render(jsonValue))
+
+  /**
+   * JSON representation of the row.
+   *
+   * Note that this only supports the data types that are also supported by
+   * [[org.apache.spark.sql.catalyst.encoders.RowEncoder]].
+   *
+   * @return the JSON representation of the row.
+   */
+  private[sql] def jsonValue: JValue = {
+require(schema != null, "JSON serialization requires a non-null schema.")
+
+lazy val timeZone = TimeZone.getTimeZone(SQLConf.get.sessionLocalTimeZone)
+lazy val dateFormatter = DateFormatter.apply(timeZone.toZoneId)
+lazy val timestampFormatter = TimestampFormatter.apply(timeZone.toZoneId)
+
+// Convert an iterator of values to a json array
+def iteratorToJsonArray(iterator: Iterator[_], elementType: DataType): 
JArray = {
+  JArray(iterator.map(toJson(_, elementType)).toList)
+}
+
+// Convert a value to json.
+def toJson(value: Any, dataType: DataType): JValue = (value, dataType) 
match {
+  case (null, _) => JNull
+  case (b: Boolean, _) => JBool(b)
+  case (b: Byte, _) => JLong(b)
+  case (s: Short, _) => JLong(s)
+  case (i: Int, _) => JLong(i)
+  case (l: Long, _) => JLong(l)
+  case (f: Float, _) => JDouble(f)
+  case (d: Double, _) => JDouble(d)
+  case (d: BigDecimal, _) => JDecimal(d)
+  case (d: java.math.BigDecimal, _) => JDecimal(d)
+  case (d: Decimal, _) => JDecimal(d.toBigDecimal)
+  case (s: String, _) => JString(s)
+  case (b: Array[Byte], BinaryType) =>
+JString(Base64.getEncoder.encodeToString(b))
+  case (d: LocalDate, _) =>
+JString(dateFormatter.format(DateTimeUtils.localDateToDays(d)))
+  case (d: Date, _) =>
+JString(dateFormatter.format(DateTimeUtils.fromJavaDate(d)))
+  case (i: Instant, _) =>
+JString(timestampFormatter.format(DateTimeUtils.instantToMicros(i)))
+  case (t: Timestamp, _) =>
+JString(timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t)))
+  case (a: Array[_], ArrayType(elementType, _)) =>
+iteratorToJsonArray(a.iterator, elementType)
+  case (s: Seq[_], ArrayType(elementType, _)) =>
+iteratorToJsonArray(s.iterator, elementType)
+  case (m: Map[String @unchecked, _], MapType(StringType, valueType, _)) =>
 
 Review comment:
   Can other primitive types like Int be good for this format too?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling

2019-10-10 Thread GitBox

abellina commented on a change in pull request #26078: [SPARK-29151][CORE] 
Support fractional resources for task resource scheduling
URL: https://github.com/apache/spark/pull/26078#discussion_r333704705
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala
 ##
 @@ -30,27 +30,35 @@ trait ResourceAllocator {
 
   protected def resourceName: String
   protected def resourceAddresses: Seq[String]
+  protected def resourcesPerAddress: Int
 
   /**
-   * Map from an address to its availability, the value `true` means the 
address is available,
-   * while value `false` means the address is assigned.
+   * Map from an address to its availability, a value > 0 means the address is 
available,
+   * while value of 0 means the address is fully assigned.
+   *
+   * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), 
this value
+   * can be fractional.
 
 Review comment:
   good catch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] abellina commented on a change in pull request #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling

2019-10-10 Thread GitBox

abellina commented on a change in pull request #26078: [SPARK-29151][CORE] 
Support fractional resources for task resource scheduling
URL: https://github.com/apache/spark/pull/26078#discussion_r333704288
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourceInfo.scala
 ##
 @@ -25,10 +25,13 @@ import org.apache.spark.resource.{ResourceAllocator, 
ResourceInformation}
  * information.
  * @param name Resource name
  * @param addresses Resource addresses provided by the executor
+ * @param numParts Number of ways each resource is subdivided when scheduling 
tasks
  */
-private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String])
+private[spark] class ExecutorResourceInfo(name: String, addresses: Seq[String],
+  numParts: Int)
   extends ResourceInformation(name, addresses.toArray) with ResourceAllocator {
 
   override protected def resourceName = this.name
   override protected def resourceAddresses = this.addresses
+  override protected def resourcesPerAddress = numParts
 
 Review comment:
   One issue here is that this would be part of the `ResourceAllocator` 
interface, of which there are two `WorkerResourceInfo`, which sets the 
`resourcesPerAddress` to 1, and `ExecutorResourceInfo`, which is the variable 
one. 
   
   Could we all it something like `slotsPerAddress`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service

2019-10-10 Thread GitBox

tgravescs commented on a change in pull request #26000: 
[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
URL: https://github.com/apache/spark/pull/26000#discussion_r333690639
 
 

 ##
 File path: docs/running-on-yarn.md
 ##
 @@ -492,6 +492,13 @@ To use a custom metrics.properties for the application 
master and executors, upd
   If it is not set then the YARN application ID is used.
   
 
+
+  spark.yarn.shuffle.service.name
+  spark_shuffle
+  
+Name of the external shuffle service.
 
 Review comment:
   many newbie's aren't familiar with what external shuffle service is or even 
yarn so its best to be clear. How about:
   
   The name of the external shuffle service.
   The external shuffle service itself is configured and started by YARN (see 
Configuring the External Shuffle Service for details). The name specified here 
must match the name YARN used.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service

2019-10-10 Thread GitBox

tgravescs commented on a change in pull request #26000: 
[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
URL: https://github.com/apache/spark/pull/26000#discussion_r333691945
 
 

 ##
 File path: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 ##
 @@ -136,7 +136,11 @@
   private DB db;
 
   public YarnShuffleService() {
-super("spark_shuffle");
+this("spark_shuffle");
+  }
+
+  protected YarnShuffleService(String serviceName) {
 
 Review comment:
   So the name by itself isn't going to be enough.  If you really want it 
configurable we are going to have to have the port configurable. For instance 
the config name for the port spark.shuffle.service.port needs to be able to be 
something like spark.shuffle.service.{serviceName}.port.  Otherwise all the 
spark shuffle servers will try to get the same port and fail.  The only other 
option will be to use 0 for ephemeral but 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tgravescs commented on a change in pull request #26000: [SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service

2019-10-10 Thread GitBox

tgravescs commented on a change in pull request #26000: 
[SPARK-29330][CORE][YARN] Allow users to chose the name of Spark Shuffle service
URL: https://github.com/apache/spark/pull/26000#discussion_r333656098
 
 

 ##
 File path: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
 ##
 @@ -136,7 +136,11 @@
   private DB db;
 
   public YarnShuffleService() {
-super("spark_shuffle");
+this("spark_shuffle");
+  }
+
+  protected YarnShuffleService(String serviceName) {
+super(serviceName);
 logger.info("Initializing YARN shuffle service for Spark");
 
 Review comment:
   lets change the log statement to have the servicename in it


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] abellina commented on issue #26078: [SPARK-29151][CORE] Support fractional resources for task resource scheduling

2019-10-10 Thread GitBox

abellina commented on issue #26078: [SPARK-29151][CORE] Support fractional 
resources for task resource scheduling
URL: https://github.com/apache/spark/pull/26078#issuecomment-540728997
 
 
   Thanks @tgravescs will address. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] squito commented on a change in pull request #26059: [SPARK-29398][core] Support dedicated thread pools for RPC endpoints.

2019-10-10 Thread GitBox

squito commented on a change in pull request #26059: [SPARK-29398][core] 
Support dedicated thread pools for RPC endpoints.
URL: https://github.com/apache/spark/pull/26059#discussion_r333683292
 
 

 ##
 File path: core/src/main/scala/org/apache/spark/rpc/RpcEndpoint.scala
 ##
 @@ -146,3 +146,13 @@ private[spark] trait RpcEndpoint {
  * [[ThreadSafeRpcEndpoint]] for different messages.
  */
 private[spark] trait ThreadSafeRpcEndpoint extends RpcEndpoint
+
+/**
+ * An endpoint that uses a dedicated thread pool for delivering messages.
+ */
+private[spark] trait IsolatedRpcEndpoint extends RpcEndpoint {
+
+  /** How many threads to use for delivering messages. By default, use a 
single thread. */
+  def threadCount(): Int = 1
 
 Review comment:
   I'm trying to wrap my head around what happens if you create an 
`IsolatedRpcEndpoint` with threadCount() > 1, given the code in `Inbox` which 
checks for inheritance from `ThreadSafeRpcEndpoint`: 
https://github.com/apache/spark/blob/2b3c3793c97dc0e48f1e098c6f1038f51607b716/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala#L123
   
   I guess if you expect one endpoint to be served by multiple threads, it 
makes sense you'd want `Inbox.enableConcurrent = false` and you'd have to make 
your endpoint safe for that -- but worth a comment here at least.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ifilonenko commented on issue #25785: [SPARK-27812][CORE] Explicit System.exit after job's main

2019-10-10 Thread GitBox

ifilonenko commented on issue #25785: [SPARK-27812][CORE] Explicit System.exit 
after job's main
URL: https://github.com/apache/spark/pull/25785#issuecomment-540721815
 
 
   hey @igorcalabria, do you have a link to your PR in `kubernetes-client`? 
This problem affects our spark-on-k8s jobs and I would like to have a timeline 
on the patch or offer help where I can. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] Document SHOW TABLES in SQL Reference.

2019-10-10 Thread GitBox

srowen commented on a change in pull request #25561: [SPARK-28810][DOC][SQL] 
Document SHOW TABLES in SQL Reference.
URL: https://github.com/apache/spark/pull/25561#discussion_r333672276
 
 

 ##
 File path: docs/sql-ref-syntax-aux-show-tables.md
 ##
 @@ -18,5 +18,90 @@ license: |
   See the License for the specific language governing permissions and
   limitations under the License.
 ---
+### Description
 
-**This page is under construction**
+The `SHOW TABLES` statement returns all the tables for an optionally specified 
database.
+Additionally, the output of this statement may be filtered by an optional 
matching
+pattern. If no database is specified then the tables are returned from the 
+current database.
+
+### Syntax
+{% highlight sql %}
+SHOW TABLES [{FROM|IN} database_name] [LIKE 'regex_pattern']
+{% endhighlight %}
+
+### Parameters
+
+  {FROM|IN} database_name
+  
+ Specifies the database name from which tables are listed.
+  
+  LIKE regex_pattern
+  
+ Specifies the regular expression pattern that is used to filter out 
unwanted tables. 
+  
+   Except for * and | character, the pattern works like a 
regex.
 
 Review comment:
   OK, I might put back the back-ticks or quotes around special chars here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 200 matches

Mail list logo