[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471479#comment-16471479
 ] 

Yinan Li commented on SPARK-24248:
--

I think it's both more robust and easier to implement with a periodic resync, 
which is what most of the core controllers use. With this setup, you can use a 
queue to hold executor pod updates to be processed. The resync and watcher both 
enqueues pod updates, whereas a thread dequeues and processes each update 
sequentially. This avoids the need for explicit synchronization. The queue also 
serves as a cache.

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24173) Flaky Test: VersionsSuite

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24173:
--
Description: 
*BRANCH-2.2*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/

*BRANCH-2.3*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/369/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/

  was:
*BRANCH-2.2*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/

*BRANCH-2.3*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/


> Flaky Test: VersionsSuite
> -
>
> Key: SPARK-24173
> URL: https://issues.apache.org/jira/browse/SPARK-24173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *BRANCH-2.2*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/
> *BRANCH-2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/369/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24211:
--
Description: 
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/

  was:
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/


> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *windowed left outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/
> *windowed right outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24197) add array_sort function

2018-05-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24197.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21294

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11150) Dynamic partition pruning

2018-05-10 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471311#comment-16471311
 ] 

Henry Robinson commented on SPARK-11150:


The title of this JIRA is 'dynamic partition pruning', but the examples are a) 
not related to dynamic partition pruning and b) work as expected in Spark 2.3.

Spark will correctly infer that, given {{t1.foo = t2.bar AND t2.bar = 1}}, that 
{{t1.foo = 1}}. It will prune partitions statically - at compile time - and 
that is reflected in the scan.

_Dynamic_ partition pruning is about pruning partitions based on information 
that can only be inferred at run time. A typical example is:

{{SELECT * FROM dim_table JOIN fact_table ON (dim_table.partcol = 
fact_table.partcol) WHERE dim_table.othercol > 10}}.

Little can be inferred from the query at compilation time about what partitions 
to scan in {{fact_table}} (except that only the intersection between 
{{fact_table}} and {{part_table}}'s partitions should be scanned). 

However at run time, the set of partition keys produced by scanning 
{{dim_table}} with the filter predicate can be recorded - usually at the join 
node - and sent to the probe side of the join (in this case {{fact_table}}). 
The scan of {{fact_table}} can use that set to filter out any partitions that 
aren't in the build side of the join, because they wouldn't match any rows 
during the join. Hive and Impala both support this kind of partition filtering 
(and it doesn't only have to apply to partitions - you can filter the rows as 
well if evaluating the predicate isn't too expensive).

The challenges are:

* making sure that the representation chosen for the filters is compact enough 
to be shuffled around all the executors that might be performing the scan task, 
while having a low false-positive rate
* adding the logic to the planner to detect these opportunities
* optionally disabling the filtering if it's not being selective enough
* coordinating amongst the build and probe side to ensure that the latter waits 
for the former (this is a bit easier in Spark because it's not a pipelined 
execution model)

Do we agree that this JIRA should be more explicitly made about dynamic 
partition pruning, or is that tracked elsewhere? If so, I propose closing this 
one; otherwise I can edit this one's description.



> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Priority: Major
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471295#comment-16471295
 ] 

Matt Cheah commented on SPARK-24248:


I see - I suppose if the watch connection drops, we should try to reestablish 
connection to the API server periodically, and when we can reestablish it we 
can do a full sync then?

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-10 Thread smohr003 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

smohr003 updated SPARK-24233:
-
Description: 
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced dfAll is 4, which in this example should be 10. 

  was:
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced df is 4, which in this example should be 10. 


> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced dfAll is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471288#comment-16471288
 ] 

Yinan Li commented on SPARK-24248:
--

Just realized one thing: solely replying on the watcher poses risks of losing 
executor pod updates. This can potentially happen for example if the API server 
gets restarted or if the watch connection is interrupted temporarily while the 
pods are running. So periodic polling is still needed. This is referred to as 
resync in controller term. Enabling resync is almost always a good thing. 

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471283#comment-16471283
 ] 

Takeshi Yamamuro commented on SPARK-24204:
--

ok, I'll do it later. Thanks for the description update, too.

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471280#comment-16471280
 ] 

Matt Cheah edited comment on SPARK-24248 at 5/10/18 11:18 PM:
--

I thought about it a bit more, and believe that we can do most if not all of 
our actions in the Watcher directly.

In other words, can we drive the entire lifecycle of all executors solely via 
watch events? This would be a pretty big rewrite, but you get a large number of 
benefits from this - namely, you remove any need for synchronization or local 
state management at all. {{KubernetesClusterSchedulerBackend}} becomes 
effectively stateless, apart from the parent class's fields.

This would imply at least the following changes, though I'm sure I'm missing 
some:
 * When the watcher receives a modified or error event, check the status of the 
executor, construct the exit reason, and call {{RemoveExecutor}} directly
 * The watcher keeps a running count of active executors and itself triggers 
rounds of creating new executors (instead of the periodic polling)
 * {{KubernetesDriverEndpoint::onDisconnected}} is a tricky one. What I'm 
thinking is that we can just disable the executor but not remove it, counting 
on the Watch to receive an event that would actually trigger removing the 
executor. The idea here is that the status of the pods as reported by the Watch 
should be fully reliable - e.g. whenever any error occurs in the executor such 
that it becomes unusable, the Kubernetes API should report such state. We could 
perhaps make the API's representation of the world more accurate by attaching 
liveness probes to the executor pod.

 


was (Author: mcheah):
I thought about it a bit more, and believe that we can do most if not all of 
our actions in the Watcher directly.

In other words, can we drive the entire lifecycle of all executors solely from 
the perspective of watch events? This would be a pretty big rewrite, but you 
get a large number of benefits from this - namely, you remove any need for 
synchronization or local state management at all. 
{{KubernetesClusterSchedulerBackend}} becomes effectively stateless, apart from 
the parent class's fields.

This would imply at least the following changes, though I'm sure I'm missing 
some:
 * When the watcher receives a modified or error event, check the status of the 
executor, construct the exit reason, and call \{{RemoveExecutor}} directly
 * The watcher keeps a running count of active executors and itself triggers 
rounds of creating new executors (instead of the periodic polling)
 * {\{KubernetesDriverEndpoint::onDisconnected}} is a tricky one. What I'm 
thinking is that we can just disable the executor but not remove it, counting 
on the Watch to receive an event that would actually trigger removing the 
executor. The idea here is that the status of the pods as reported by the Watch 
should be fully reliable - e.g. whenever any error occurs in the executor such 
that it becomes unusable, the Kubernetes API should report such state. We could 
perhaps make the API's representation of the world more accurate by attaching 
liveness probes to the executor pod.

 

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471280#comment-16471280
 ] 

Matt Cheah commented on SPARK-24248:


I thought about it a bit more, and believe that we can do most if not all of 
our actions in the Watcher directly.

In other words, can we drive the entire lifecycle of all executors solely from 
the perspective of watch events? This would be a pretty big rewrite, but you 
get a large number of benefits from this - namely, you remove any need for 
synchronization or local state management at all. 
{{KubernetesClusterSchedulerBackend}} becomes effectively stateless, apart from 
the parent class's fields.

This would imply at least the following changes, though I'm sure I'm missing 
some:
 * When the watcher receives a modified or error event, check the status of the 
executor, construct the exit reason, and call \{{RemoveExecutor}} directly
 * The watcher keeps a running count of active executors and itself triggers 
rounds of creating new executors (instead of the periodic polling)
 * {\{KubernetesDriverEndpoint::onDisconnected}} is a tricky one. What I'm 
thinking is that we can just disable the executor but not remove it, counting 
on the Watch to receive an event that would actually trigger removing the 
executor. The idea here is that the status of the pods as reported by the Watch 
should be fully reliable - e.g. whenever any error occurs in the executor such 
that it becomes unusable, the Kubernetes API should report such state. We could 
perhaps make the API's representation of the world more accurate by attaching 
liveness probes to the executor pod.

 

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471259#comment-16471259
 ] 

Yinan Li commented on SPARK-24248:
--

Actually even if the fabric8 client does not support caching, we can 
effectively achieve that and greatly simplify our code logic by doing the 
following:
 # Get rid of the existing in-memory data structures and replace them with a 
single in-memory cache of all live executor pod objects.
 # The cache is updated on every watch events. A new pod event adds one entry 
to the cache, a modification event updates an existing object, and a deletion 
event deletes the object.
 # Always get status of an executor pod by retrieving the pod object from the 
cache, falling back to talking to the API server if there's a cache miss (due 
to the delay of the watch event).

Thoughts?

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471244#comment-16471244
 ] 

Yinan Li commented on SPARK-24248:
--

It's potentially possible to get rid of the in-memory state in favor of getting 
pod state from the pod objects directly if we are fine with the performance 
penalty of communicating with the API server for each state check. One 
optimization is to cache executor pod objects so retrieving them doesn't 
involve network communication. This is possible with the golang client library, 
but I'm not sure about the Java client we use.  

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24198) add slice function

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471215#comment-16471215
 ] 

Apache Spark commented on SPARK-24198:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21298

> add slice function
> --
>
> Key: SPARK-24198
> URL: https://issues.apache.org/jira/browse/SPARK-24198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24198) add slice function

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24198:


Assignee: Apache Spark

> add slice function
> --
>
> Key: SPARK-24198
> URL: https://issues.apache.org/jira/browse/SPARK-24198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Apache Spark
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24198) add slice function

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24198:


Assignee: (was: Apache Spark)

> add slice function
> --
>
> Key: SPARK-24198
> URL: https://issues.apache.org/jira/browse/SPARK-24198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471206#comment-16471206
 ] 

Matt Cheah commented on SPARK-24248:


[~foxish] [~liyinan926] curious as to what you think about this idea.

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-10 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-24248:
--

 Summary: [K8S] Use the Kubernetes cluster as the backing store for 
the state of pods
 Key: SPARK-24248
 URL: https://issues.apache.org/jira/browse/SPARK-24248
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Matt Cheah


We have a number of places in KubernetesClusterSchedulerBackend right now that 
maintains the state of pods in memory. However, the Kubernetes API can always 
give us the most up to date and correct view of what our executors are doing. 
We should consider moving away from in-memory state as much as can in favor of 
using the Kubernetes cluster as the source of truth for pod status. Maintaining 
less state in memory makes it so that there's a lower chance that we 
accidentally miss updating one of these data structures and breaking the 
lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24247) [K8S] currentNodeToLocalTaskCount is unused in KubernetesClusterSchedulerBackend

2018-05-10 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-24247:
--

 Summary: [K8S] currentNodeToLocalTaskCount is unused in 
KubernetesClusterSchedulerBackend
 Key: SPARK-24247
 URL: https://issues.apache.org/jira/browse/SPARK-24247
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Matt Cheah


This variable isn't used: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L109]
 - we should either remove it or else be putting it to good use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10878.

   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0
   2.3.1

I don't know what's the jira username for victsm, so assigning to Kazuaki.

> Race condition when resolving Maven coordinates via Ivy
> ---
>
> Key: SPARK-10878
> URL: https://issues.apache.org/jira/browse/SPARK-10878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> I've recently been shell-scripting the creation of many concurrent 
> Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
> guessing is a race condition in their Maven-coordinate resolution.
> For example, I might spawn an app for each path in file {{paths}} with the 
> following shell script:
> {code}
> cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
> {code}
> When doing this, I observe some fraction of the spawned jobs to fail with 
> errors like:
> {code}
> :: retrieving :: org.apache.spark#spark-submit-parent
> confs: [default]
> Exception in thread "main" java.lang.RuntimeException: problem during 
> retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
> failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
> at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
> at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.text.ParseException: failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
> ... 7 more
> Caused by: org.xml.sax.SAXParseException; Premature end of file.
> at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> {code}
> The more apps I try to launch simultaneously, the greater fraction of them 
> seem to fail with this or similar errors; a batch of ~10 will usually work 
> fine, a batch of 15 will see a few failures, and a batch of ~60 will have 
> dozens of failures.
> [This gist shows 11 recent failures I 
> observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-24217:
---
Comment: was deleted

(was: Thanks for the clarification Joseph K. Bradley

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)
Id   prediction
 1 0
 2 0
 3 0
 4 0
 5 0
 6 1
7  1
8   1
9  1
  10  1)

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471128#comment-16471128
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:28 PM:
-

Thanks for the clarification Joseph K. Bradley

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)
Id   prediction
 1 0
 2 0
 3 0
 4 0
 5 0
 6 1
7  1
8   1
9  1
  10  1


was (Author: shahid):
Thanks for the clarification Joseph K. Bradley

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction
 1 0
 2 0
 3 0
 4 0
 5 0
 6 1
7  1
8   1
9  1
  10  1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19181.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21280
[https://github.com/apache/spark/pull/21280]

> SparkListenerSuite.local metrics fails when average executorDeserializeTime 
> is too short.
> -
>
> Key: SPARK-19181
> URL: https://issues.apache.org/jira/browse/SPARK-19181
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249
> The "local metrics" test asserts that tasks should take more than 1ms on 
> average to complete, even though a code comment notes that this is a small 
> test and tasks may finish faster. I've been seeing some "failures" here on 
> fast systems that finish these tasks quite quickly.
> There are a few ways forward here:
> 1. Disable this test.
> 2. Relax this check.
> 3. Implement sub-millisecond granularity for task times throughout Spark.
> 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
> reference a partition that implements a custom Externalizable.readExternal, 
> which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19181) SparkListenerSuite.local metrics fails when average executorDeserializeTime is too short.

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19181:
--

Assignee: Attila Zsolt Piros

> SparkListenerSuite.local metrics fails when average executorDeserializeTime 
> is too short.
> -
>
> Key: SPARK-19181
> URL: https://issues.apache.org/jira/browse/SPARK-19181
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala#L249
> The "local metrics" test asserts that tasks should take more than 1ms on 
> average to complete, even though a code comment notes that this is a small 
> test and tasks may finish faster. I've been seeing some "failures" here on 
> fast systems that finish these tasks quite quickly.
> There are a few ways forward here:
> 1. Disable this test.
> 2. Relax this check.
> 3. Implement sub-millisecond granularity for task times throughout Spark.
> 4. (Imran Rashid's suggestion) Add buffer time by, say, having the task 
> reference a partition that implements a custom Externalizable.readExternal, 
> which always waits 1ms before returning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471128#comment-16471128
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:26 PM:
-

Thanks for the clarification Joseph K. Bradley

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction
 1 0
 2 0
 3 0
 4 0
 5 0
 6 1
7  1
8   1
9  1
  10  1


was (Author: shahid):
Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id   prediction  

       1     0
   2 0
   3  0
   4  0
   5 0
   6     1
   7 1
   8  1
   9  1
  10 1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471128#comment-16471128
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:23 PM:
-

Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id   prediction  

       1     0
   2 0
   3  0
   4  0
   5 0
   6     1
   7 1
   8  1
   9  1
  10 1


was (Author: shahid):
Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     idprediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7  1
   8  1
   9   1
  10 1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471128#comment-16471128
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:19 PM:
-

Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     idprediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7  1
   8  1
   9   1
  10 1


was (Author: shahid):
Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7   1
   8   1
   9   1
   10  1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24246) Improve AnalysisException by setting the cause when it's available

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24246:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Improve AnalysisException by setting the cause when it's available
> --
>
> Key: SPARK-24246
> URL: https://issues.apache.org/jira/browse/SPARK-24246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> If there is an exception, it's better to set it as the cause of 
> AnalysisException since the exception may contain useful debug information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24246) Improve AnalysisException by setting the cause when it's available

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24246:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Improve AnalysisException by setting the cause when it's available
> --
>
> Key: SPARK-24246
> URL: https://issues.apache.org/jira/browse/SPARK-24246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> If there is an exception, it's better to set it as the cause of 
> AnalysisException since the exception may contain useful debug information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24246) Improve AnalysisException by setting the cause when it's available

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471131#comment-16471131
 ] 

Apache Spark commented on SPARK-24246:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/21297

> Improve AnalysisException by setting the cause when it's available
> --
>
> Key: SPARK-24246
> URL: https://issues.apache.org/jira/browse/SPARK-24246
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> If there is an exception, it's better to set it as the cause of 
> AnalysisException since the exception may contain useful debug information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-24217:
---
Comment: was deleted

(was: Thanks for the clarification. 

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7   1
   8   1
   9   1
   10  1)

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471128#comment-16471128
 ] 

shahid commented on SPARK-24217:


Thanks for the clarification Joseph K. Bradley


Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7   1
   8   1
   9   1
   10  1

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471040#comment-16471040
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:15 PM:
-

Thanks for the clarification. 

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml*  (With my fix)

     id prediction  

      1       0
  2   0
  3   0
  4   0
  5   0
   6      1
   7   1
   8   1
   9   1
   10  1


was (Author: shahid):
Thanks for the clarification. 

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24246) Improve AnalysisException by setting the cause when it's available

2018-05-10 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-24246:


 Summary: Improve AnalysisException by setting the cause when it's 
available
 Key: SPARK-24246
 URL: https://issues.apache.org/jira/browse/SPARK-24246
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


If there is an exception, it's better to set it as the cause of 
AnalysisException since the exception may contain useful debug information.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471040#comment-16471040
 ] 

shahid edited comment on SPARK-24217 at 5/10/18 9:11 PM:
-

Thanks for the clarification. 

Is it really required to append the result with the input dataframe? Because 
with the existing implementation, i can able to get the desired output with my 
fix.

 


was (Author: shahid):
Thanks for the clarification. I am closing the PR.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: shahid
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23681) Switch OrcFileFormat to newer hadoop.mapreduce output classes

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23681:
---
Target Version/s:   (was: 2.3.1)

> Switch OrcFileFormat to newer hadoop.mapreduce output classes
> -
>
> Key: SPARK-23681
> URL: https://issues.apache.org/jira/browse/SPARK-23681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Steve Loughran
>Priority: Minor
>
> the classes in org.apache.spark.sql.execution.datasources.orc generate their 
> file output writer and bind to an output committer via the old, original, 
> barely maintained {{org.apache.hadoop.mapred.FileOutputFormat}} which is 
> inflexible & doesn't support pluggable committers a la 
> MAPREDUCE-6956/HADOOP-13786.
> Moving to the hadoop.mapreduce packages for this is compatible & the spark 
> layer, switches over to the maintained codebase & lets you pick up the new 
> committers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24245) Flaky test: KafkaContinuousSinkSuite

2018-05-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471077#comment-16471077
 ] 

Marcelo Vanzin commented on SPARK-24245:


Lowering since it doesn't seem that flaky looking at previous failures 
(although those ended up failing before this test could run).

> Flaky test: KafkaContinuousSinkSuite
> 
>
> Key: SPARK-24245
> URL: https://issues.apache.org/jira/browse/SPARK-24245
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.1
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This test seems to be broken or flaky. From jenkins:
> https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/367/
> {noformat}
> Caused by: org.scalatest.exceptions.TestFailedException: -1 did not equal 0
> at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:395)
> at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:409)
> ... 54 more
> SUITE ABORTED - KafkaContinuousSinkSuite: The code passed to eventually never 
> returned normally. Attempted 1990 times over 30.007800773 seconds. Last 
> failure message: -1 did not equal 0.
> {noformat}
> We should fix it or disable it in 2.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24245) Flaky test: KafkaContinuousSinkSuite

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24245:
---
Target Version/s:   (was: 2.3.1)

> Flaky test: KafkaContinuousSinkSuite
> 
>
> Key: SPARK-24245
> URL: https://issues.apache.org/jira/browse/SPARK-24245
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.1
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This test seems to be broken or flaky. From jenkins:
> https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/367/
> {noformat}
> Caused by: org.scalatest.exceptions.TestFailedException: -1 did not equal 0
> at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:395)
> at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:409)
> ... 54 more
> SUITE ABORTED - KafkaContinuousSinkSuite: The code passed to eventually never 
> returned normally. Attempted 1990 times over 30.007800773 seconds. Last 
> failure message: -1 did not equal 0.
> {noformat}
> We should fix it or disable it in 2.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24245) Flaky test: KafkaContinuousSinkSuite

2018-05-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24245:
---
Priority: Major  (was: Critical)

> Flaky test: KafkaContinuousSinkSuite
> 
>
> Key: SPARK-24245
> URL: https://issues.apache.org/jira/browse/SPARK-24245
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.1
>Reporter: Marcelo Vanzin
>Priority: Major
>
> This test seems to be broken or flaky. From jenkins:
> https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/367/
> {noformat}
> Caused by: org.scalatest.exceptions.TestFailedException: -1 did not equal 0
> at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
> at 
> org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:395)
> at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:409)
> ... 54 more
> SUITE ABORTED - KafkaContinuousSinkSuite: The code passed to eventually never 
> returned normally. Attempted 1990 times over 30.007800773 seconds. Last 
> failure message: -1 did not equal 0.
> {noformat}
> We should fix it or disable it in 2.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24245) Flaky test: KafkaContinuousSinkSuite

2018-05-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24245:
--

 Summary: Flaky test: KafkaContinuousSinkSuite
 Key: SPARK-24245
 URL: https://issues.apache.org/jira/browse/SPARK-24245
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming, Tests
Affects Versions: 2.3.1
Reporter: Marcelo Vanzin


This test seems to be broken or flaky. From jenkins:

https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/367/

{noformat}
Caused by: org.scalatest.exceptions.TestFailedException: -1 did not equal 0
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
at 
org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
at 
org.apache.spark.sql.kafka010.KafkaContinuousTest$$anonfun$afterEach$1.apply(KafkaContinuousTest.scala:76)
at 
org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:395)
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:409)
... 54 more
SUITE ABORTED - KafkaContinuousSinkSuite: The code passed to eventually never 
returned normally. Attempted 1990 times over 30.007800773 seconds. Last failure 
message: -1 did not equal 0.
{noformat}

We should fix it or disable it in 2.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471040#comment-16471040
 ] 

spark_user commented on SPARK-24217:


Thanks for the clarification. I am closing the PR.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-10 Thread Yinan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li updated SPARK-24137:
-
Fix Version/s: (was: 2.3.1)

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24230:


Assignee: Apache Spark

> With Parquet 1.10 upgrade has errors in the vectorized reader
> -
>
> Key: SPARK-24230
> URL: https://issues.apache.org/jira/browse/SPARK-24230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ian O Connell
>Assignee: Apache Spark
>Priority: Major
>
> When reading some parquet files can get an error like:
> java.io.IOException: expecting more rows but reached last block. Read 0 out 
> of 1194236
> This happens when looking for a needle thats pretty rare in a large haystack.
>  
> The issue here I believe is that the total row count is calculated at
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229]
>  
> But we pass the blocks we filtered via 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups
> to the ParquetFileReader constructor.
>  
> However the ParquetFileReader constructor will filter the list of blocks 
> again using
>  
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737]
>  
> if a block is filtered out by the latter method, and not the former the 
> vectorized reader will believe it should see more rows than it will.
> the fix I used locally is pretty straight forward:
> {code:java}
> for (BlockMetaData block : blocks) {
> this.totalRowCount += block.getRowCount();
> }
> {code}
> goes to
> {code:java}
> this.totalRowCount = this.reader.getRecordCount();
> {code}
> [~rdblue] do you know if this sounds right? The second filter method in the 
> ParquetFileReader might filter more blocks leading to the count being off? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24244) Parse only required columns of CSV file

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470947#comment-16470947
 ] 

Apache Spark commented on SPARK-24244:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21296

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470948#comment-16470948
 ] 

Apache Spark commented on SPARK-24230:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21295

> With Parquet 1.10 upgrade has errors in the vectorized reader
> -
>
> Key: SPARK-24230
> URL: https://issues.apache.org/jira/browse/SPARK-24230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ian O Connell
>Priority: Major
>
> When reading some parquet files can get an error like:
> java.io.IOException: expecting more rows but reached last block. Read 0 out 
> of 1194236
> This happens when looking for a needle thats pretty rare in a large haystack.
>  
> The issue here I believe is that the total row count is calculated at
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229]
>  
> But we pass the blocks we filtered via 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups
> to the ParquetFileReader constructor.
>  
> However the ParquetFileReader constructor will filter the list of blocks 
> again using
>  
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737]
>  
> if a block is filtered out by the latter method, and not the former the 
> vectorized reader will believe it should see more rows than it will.
> the fix I used locally is pretty straight forward:
> {code:java}
> for (BlockMetaData block : blocks) {
> this.totalRowCount += block.getRowCount();
> }
> {code}
> goes to
> {code:java}
> this.totalRowCount = this.reader.getRecordCount();
> {code}
> [~rdblue] do you know if this sounds right? The second filter method in the 
> ParquetFileReader might filter more blocks leading to the count being off? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24244) Parse only required columns of CSV file

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24244:


Assignee: Apache Spark

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24230) With Parquet 1.10 upgrade has errors in the vectorized reader

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24230:


Assignee: (was: Apache Spark)

> With Parquet 1.10 upgrade has errors in the vectorized reader
> -
>
> Key: SPARK-24230
> URL: https://issues.apache.org/jira/browse/SPARK-24230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ian O Connell
>Priority: Major
>
> When reading some parquet files can get an error like:
> java.io.IOException: expecting more rows but reached last block. Read 0 out 
> of 1194236
> This happens when looking for a needle thats pretty rare in a large haystack.
>  
> The issue here I believe is that the total row count is calculated at
> [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L229]
>  
> But we pass the blocks we filtered via 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups
> to the ParquetFileReader constructor.
>  
> However the ParquetFileReader constructor will filter the list of blocks 
> again using
>  
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L737]
>  
> if a block is filtered out by the latter method, and not the former the 
> vectorized reader will believe it should see more rows than it will.
> the fix I used locally is pretty straight forward:
> {code:java}
> for (BlockMetaData block : blocks) {
> this.totalRowCount += block.getRowCount();
> }
> {code}
> goes to
> {code:java}
> this.totalRowCount = this.reader.getRecordCount();
> {code}
> [~rdblue] do you know if this sounds right? The second filter method in the 
> ParquetFileReader might filter more blocks leading to the count being off? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24244) Parse only required columns of CSV file

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24244:


Assignee: (was: Apache Spark)

> Parse only required columns of CSV file
> ---
>
> Key: SPARK-24244
> URL: https://issues.apache.org/jira/browse/SPARK-24244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> uniVocity parser allows to specify only required column names or indexes for 
> parsing like:
> {code}
> // Here we select only the columns by their indexes.
> // The parser just skips the values in other columns
> parserSettings.selectIndexes(4, 0, 1);
> CsvParser parser = new CsvParser(parserSettings);
> {code}
> Need to modify *UnivocityParser* to extract only needed columns from 
> requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24244) Parse only required columns of CSV file

2018-05-10 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24244:
--

 Summary: Parse only required columns of CSV file
 Key: SPARK-24244
 URL: https://issues.apache.org/jira/browse/SPARK-24244
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


uniVocity parser allows to specify only required column names or indexes for 
parsing like:
{code}
// Here we select only the columns by their indexes.
// The parser just skips the values in other columns
parserSettings.selectIndexes(4, 0, 1);
CsvParser parser = new CsvParser(parserSettings);
{code}

Need to modify *UnivocityParser* to extract only needed columns from 
requiredSchema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24228) Fix the lint error

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24228:
--
Priority: Minor  (was: Major)

> Fix the lint error
> --
>
> Key: SPARK-24228
> URL: https://issues.apache.org/jira/browse/SPARK-24228
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Minor
>
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8]
>  (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
> [ERROR] 
> src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8]
>  (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-10 Thread Yinan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li updated SPARK-24137:
-
Fix Version/s: 2.3.1

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 2.3.1, 3.0.0
>
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-10 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan reassigned SPARK-24137:
--

Assignee: Matt Cheah

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24137) [K8s] Mount temporary directories in emptydir volumes

2018-05-10 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-24137.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21238
[https://github.com/apache/spark/pull/21238]

> [K8s] Mount temporary directories in emptydir volumes
> -
>
> Key: SPARK-24137
> URL: https://issues.apache.org/jira/browse/SPARK-24137
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the Spark local directories do not get any volumes and volume 
> mounts, which means we're writing Spark shuffle and cache contents to the 
> file system mounted by Docker. This can be terribly inefficient. We should 
> use emptydir volumes for these directories instead for significant 
> performance improvements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24225) Support closing AutoClosable objects in MemoryStore so Broadcast Variables can be released properly

2018-05-10 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24225:

Shepherd: DB Tsai
Priority: Major  (was: Minor)

> Support closing AutoClosable objects in MemoryStore so Broadcast Variables 
> can be released properly
> ---
>
> Key: SPARK-24225
> URL: https://issues.apache.org/jira/browse/SPARK-24225
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.3, 2.2.0, 2.2.1, 2.3.0
>Reporter: Doug Rohrer
>Assignee: Doug Rohrer
>Priority: Major
>
> When using Broadcast Variables, it would be beneficial if classes 
> implementing AutoClosable were closed when released. This would allow 
> broadcast variables to be used, for example, as shared resource pools across 
> multiple tasks within an executor without the use of static variables. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24225) Support closing AutoClosable objects in MemoryStore so Broadcast Variables can be released properly

2018-05-10 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24225:

Issue Type: New Feature  (was: Improvement)

> Support closing AutoClosable objects in MemoryStore so Broadcast Variables 
> can be released properly
> ---
>
> Key: SPARK-24225
> URL: https://issues.apache.org/jira/browse/SPARK-24225
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Affects Versions: 1.6.3, 2.2.0, 2.2.1, 2.3.0
>Reporter: Doug Rohrer
>Assignee: Doug Rohrer
>Priority: Major
>
> When using Broadcast Variables, it would be beneficial if classes 
> implementing AutoClosable were closed when released. This would allow 
> broadcast variables to be used, for example, as shared resource pools across 
> multiple tasks within an executor without the use of static variables. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24225) Support closing AutoClosable objects in MemoryStore so Broadcast Variables can be released properly

2018-05-10 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24225:
---

Assignee: Doug Rohrer

> Support closing AutoClosable objects in MemoryStore so Broadcast Variables 
> can be released properly
> ---
>
> Key: SPARK-24225
> URL: https://issues.apache.org/jira/browse/SPARK-24225
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.3, 2.2.0, 2.2.1, 2.3.0
>Reporter: Doug Rohrer
>Assignee: Doug Rohrer
>Priority: Minor
>
> When using Broadcast Variables, it would be beneficial if classes 
> implementing AutoClosable were closed when released. This would allow 
> broadcast variables to be used, for example, as shared resource pools across 
> multiple tasks within an executor without the use of static variables. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-05-10 Thread Franck Tago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470843#comment-16470843
 ] 

Franck Tago commented on SPARK-23519:
-

I do not agree with the 'typical database' claim . 

mysql , oracle  , hive support this  syntax. 

 

example

!image-2018-05-10-10-48-57-259.png!

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-05-10 Thread Franck Tago (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Attachment: image-2018-05-10-10-48-57-259.png

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-05-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470797#comment-16470797
 ] 

Marcelo Vanzin commented on SPARK-24243:


Sure. As long as the child process handle returns something meaningful.

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-10 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470789#comment-16470789
 ] 

Edwina Lu commented on SPARK-23206:
---

[~irashid], I do not have the rest of the changes for 2.3/master yet. The 
original changes were done for 2.1, and also somewhat different.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470777#comment-16470777
 ] 

Felix Cheung edited comment on SPARK-23206 at 5/10/18 5:20 PM:
---

yes, for us network and disk IO stats. We have been discussing with Edwina and 
her team.


was (Author: felixcheung):
yes, for use network and disk IO stats. We have been discussing with Edwina and 
her team.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-05-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470777#comment-16470777
 ] 

Felix Cheung commented on SPARK-23206:
--

yes, for use network and disk IO stats. We have been discussing with Edwina and 
her team.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23458) Flaky test: OrcQuerySuite

2018-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470735#comment-16470735
 ] 

Dongjoon Hyun edited comment on SPARK-23458 at 5/10/18 5:10 PM:


Oh, I missed your ping here, [~smilegator]. According to the given log, the 
remaining flakiness of HiveExternalCatalogVersionsSuite seems to be 
`Py4JJavaError`. It's weird.
{code}
2018-05-07 23:14:55.233 - stderr> SLF4J: Class path contains multiple SLF4J 
bindings.
2018-05-07 23:14:55.233 - stderr> SLF4J: Found binding in 
[jar:file:/tmp/test-spark/spark-2.0.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2018-05-07 23:14:55.233 - stderr> SLF4J: Found binding in 
[jar:file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2018-05-07 23:14:55.233 - stderr> SLF4J: See 
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2018-05-07 23:14:55.233 - stderr> SLF4J: Actual binding is of type 
[org.slf4j.impl.Log4jLoggerFactory]
2018-05-07 23:14:55.532 - stdout> 23:14:55.532 WARN 
org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
2018-05-07 23:14:57.982 - stdout> 23:14:57.982 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-core-3.2.10.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-core/jars/datanucleus-core-3.2.10.jar."
2018-05-07 23:14:57.988 - stdout> 23:14:57.988 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-api-jdo-3.2.6.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-api-jdo/jars/datanucleus-api-jdo-3.2.6.jar."
2018-05-07 23:14:57.99 - stdout> 23:14:57.990 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont 
have multiple JAR versions of the same plugin in the classpath. The URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-rdbms/jars/datanucleus-rdbms-3.2.9.jar"
 is already registered, and you are trying to register an identical plugin 
located at URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-rdbms-3.2.9.jar."
2018-05-07 23:15:17.844 - stdout> 23:15:17.843 WARN 
org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.2.0
2018-05-07 23:15:18.152 - stdout> 23:15:18.152 WARN 
org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, 
returning NoSuchObjectException
2018-05-07 23:15:22.32 - stdout> Traceback (most recent call last):
2018-05-07 23:15:22.32 - stdout>   File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test8334480132298691726.py",
 line 8, in 
2018-05-07 23:15:22.32 - stdout> spark.sql("create table data_source_tbl_{} 
using json as select 1 i".format(version_index))
2018-05-07 23:15:22.32 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 543, in sql
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
63, in deco
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", 
line 319, in get_return_value
2018-05-07 23:15:22.322 - stdout> py4j.protocol.Py4JJavaError: An error 
occurred while calling o28.sql.
2018-05-07 23:15:22.322 - stdout> : java.lang.ExceptionInInitializerError
2018-05-07 23:15:22.322 - stdout>   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
{code}

Previously, I've been monitoring [Spark QA 
Dashboard|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)].

Actually, `HiveExternalCatalogVersionSuite` passes in that branches. Only 
`spark-master-test-sbt-hadoop-2.7` branch dies for other reasons.

- 4439 Build timed out (after 275 minutes) during PySpark testing.
- 4438 terminated by signal 9
- 4437 Build timed out (after 275 minutes) during SparkR testing.
- 4436 Build timed out (after 275 

[jira] [Commented] (SPARK-23458) Flaky test: OrcQuerySuite

2018-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470735#comment-16470735
 ] 

Dongjoon Hyun commented on SPARK-23458:
---

Oh, I missed your ping here, [~smilegator]. According to the given log, the 
remaining flakiness of HiveExternalCatalogVersionsSuite seems to be 
`Py4JJavaError`. It's weird.
{code}
2018-05-07 23:14:55.233 - stderr> SLF4J: Class path contains multiple SLF4J 
bindings.
2018-05-07 23:14:55.233 - stderr> SLF4J: Found binding in 
[jar:file:/tmp/test-spark/spark-2.0.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2018-05-07 23:14:55.233 - stderr> SLF4J: Found binding in 
[jar:file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.slf4j/slf4j-log4j12/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2018-05-07 23:14:55.233 - stderr> SLF4J: See 
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2018-05-07 23:14:55.233 - stderr> SLF4J: Actual binding is of type 
[org.slf4j.impl.Log4jLoggerFactory]
2018-05-07 23:14:55.532 - stdout> 23:14:55.532 WARN 
org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
2018-05-07 23:14:57.982 - stdout> 23:14:57.982 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple 
JAR versions of the same plugin in the classpath. The URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-core-3.2.10.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-core/jars/datanucleus-core-3.2.10.jar."
2018-05-07 23:14:57.988 - stdout> 23:14:57.988 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
multiple JAR versions of the same plugin in the classpath. The URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-api-jdo-3.2.6.jar" is 
already registered, and you are trying to register an identical plugin located 
at URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-api-jdo/jars/datanucleus-api-jdo-3.2.6.jar."
2018-05-07 23:14:57.99 - stdout> 23:14:57.990 WARN DataNucleus.General: Plugin 
(Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont 
have multiple JAR versions of the same plugin in the classpath. The URL 
"file:/home/sparkivy/per-executor-caches/4/.ivy2/cache/org.datanucleus/datanucleus-rdbms/jars/datanucleus-rdbms-3.2.9.jar"
 is already registered, and you are trying to register an identical plugin 
located at URL 
"file:/tmp/test-spark/spark-2.0.2/jars/datanucleus-rdbms-3.2.9.jar."
2018-05-07 23:15:17.844 - stdout> 23:15:17.843 WARN 
org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.2.0
2018-05-07 23:15:18.152 - stdout> 23:15:18.152 WARN 
org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database default, 
returning NoSuchObjectException
2018-05-07 23:15:22.32 - stdout> Traceback (most recent call last):
2018-05-07 23:15:22.32 - stdout>   File 
"/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test8334480132298691726.py",
 line 8, in 
2018-05-07 23:15:22.32 - stdout> spark.sql("create table data_source_tbl_{} 
using json as select 1 i".format(version_index))
2018-05-07 23:15:22.32 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 543, in sql
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
63, in deco
2018-05-07 23:15:22.321 - stdout>   File 
"/tmp/test-spark/spark-2.0.2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", 
line 319, in get_return_value
2018-05-07 23:15:22.322 - stdout> py4j.protocol.Py4JJavaError: An error 
occurred while calling o28.sql.
2018-05-07 23:15:22.322 - stdout> : java.lang.ExceptionInInitializerError
2018-05-07 23:15:22.322 - stdout>   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
{code}

Previously, I've been monitoring [Spark QA 
Dashboard|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)].
 I'll take a look at test branches, too.

>  Flaky test: OrcQuerySuite
> --
>
> Key: SPARK-23458
> URL: https://issues.apache.org/jira/browse/SPARK-23458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: AMPLab Jenkins
>Reporter: 

[jira] [Commented] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470705#comment-16470705
 ] 

Joseph K. Bradley commented on SPARK-24213:
---

On the topic of eating my words, please check out my new comment here: 
[SPARK-15784].  We may need to rework the API.

> Power Iteration Clustering in the SparkML throws exception, when the ID is 
> IntType
> --
>
> Key: SPARK-24213
> URL: https://issues.apache.org/jira/browse/SPARK-24213
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> While running the code, PowerIterationClustering in spark ML throws exception.
> {code:scala}
> val data = spark.createDataFrame(Seq(
> (0, Array(1), Array(0.9)),
> (1, Array(2), Array(0.9)),
> (2, Array(3), Array(0.9)),
> (3, Array(4), Array(0.1)),
> (4, Array(5), Array(0.9))
> )).toDF("id", "neighbors", "similarities")
> val result = new PowerIterationClustering()
> .setK(2)
> .setMaxIter(10)
> .setInitMode("random")
> .transform(data)
> .select("id","prediction")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given 
> input columns: [id, neighbors, similarities];;
> 'Project [id#215, 'prediction]
> +- AnalysisBarrier
>   +- Project [id#215, neighbors#216, similarities#217]
>  +- Join Inner, (id#215 = id#234)
> :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS 
> similarities#217]
> :  +- LocalRelation [_1#209, _2#210, _3#211]
> +- Project [cast(id#230L as int) AS id#234]
>+- LogicalRDD [id#230L, prediction#231], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470704#comment-16470704
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

On the topic of eating my words, please check out my new comment here: 
[SPARK-15784].  We may need to rework the API.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701
 ] 

Joseph K. Bradley edited comment on SPARK-15784 at 5/10/18 4:45 PM:


So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility, not a 
Transformer.  We can have it inherit from Params but not make it a Transformer.

How does this sound?


was (Author: josephkb):
So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility in 
spark.ml.stat.  We can have it inherit from Params but not make it a 
Transformer.

How does this sound?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility in 
spark.ml.stat.  We can have it inherit from Params but not make it a 
Transformer.

How does this sound?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469562#comment-16469562
 ] 

Joseph K. Bradley edited comment on SPARK-24217 at 5/10/18 4:37 PM:


Update: I'll eat my words!  I should have read the docs more carefully (where I 
missed the note that there should be exactly 1 reference from one node to 
another).  This is actually a major problem with our design for PIC, which 
can't really be a Row -> Row Transformer.  Will think more about this and 
re-post.


was (Author: josephkb):
But the reason that the IDs are missing from the "id" column is that the input 
is not symmetric.  If it were made symmetric, then there could not be any 
missing IDs.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-05-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24068:
-
Fix Version/s: 2.3.1

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470634#comment-16470634
 ] 

Dongjoon Hyun edited comment on SPARK-24204 at 5/10/18 4:29 PM:


Thank you for pinging me, [~maropu]. Could you make a PR with your patch?
We need a general patch for JSON/Parquet/ORC like CSV.
cc [~smilegator]


was (Author: dongjoon):
Thank you for pinging me, [~maropu]. Could you make a PR with your patch?
We need a general patch for JSON/Parquet/ORC like CSV.

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24233) union operation on read of dataframe does nor produce correct result

2018-05-10 Thread smohr003 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

smohr003 updated SPARK-24233:
-
Description: 
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
 Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
 Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
 Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
 Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.\{FileSystem, Path}
 import java.net.URI
 def readDir(path: String): DataFrame =

{ val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
spark.read.parquet(subDir.head) val dfSchema = df.schema 
subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*)) df }

val dfAll = readDir(absolutePath)
 dfAll.count

 The count of produced df is 4, which in this example should be 10. 

  was:
I know that I can use wild card * to read all subfolders. But, I am trying to 
use .par and .schema to speed up the read process. 

val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"

Seq((1, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "1")
Seq((11, "one"), (22, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "2")
Seq((111, "one"), (222, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "3")
Seq((, "one"), (, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "4")
Seq((2, "one"), (2, "two")).toDF("k", 
"v").write.mode("overwrite").parquet(absolutePath + "5")

 

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.\{FileSystem, Path}
import java.net.URI
def readDir(path: String): DataFrame = {
 val fs = FileSystem.get(new URI(path), new Configuration())
 val subDir = fs.listStatus(new Path(path)).map(i => i.getPath.toString)
 var df = spark.read.parquet(subDir.head)
 val dfSchema = df.schema
 subDir.tail.par.foreach(p => df = 
df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
df.columns.tail:_*))
 df
}
val dfAll = readDir(absolutePath)
dfAll.count

 


> union operation on read of dataframe does nor produce correct result 
> -
>
> Key: SPARK-24233
> URL: https://issues.apache.org/jira/browse/SPARK-24233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: smohr003
>Priority: Major
>
> I know that I can use wild card * to read all subfolders. But, I am trying to 
> use .par and .schema to speed up the read process. 
> val absolutePath = "adl://datalakename.azuredatalakestore.net/testU/"
> Seq((1, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "1")
>  Seq((11, "one"), (22, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "2")
>  Seq((111, "one"), (222, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "3")
>  Seq((, "one"), (, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "4")
>  Seq((2, "one"), (2, "two")).toDF("k", 
> "v").write.mode("overwrite").parquet(absolutePath + "5")
>  
> import org.apache.hadoop.conf.Configuration
>  import org.apache.hadoop.fs.\{FileSystem, Path}
>  import java.net.URI
>  def readDir(path: String): DataFrame =
> { val fs = FileSystem.get(new URI(path), new Configuration()) val subDir = 
> fs.listStatus(new Path(path)).map(i => i.getPath.toString) var df = 
> spark.read.parquet(subDir.head) val dfSchema = df.schema 
> subDir.tail.par.foreach(p => df = 
> df.union(spark.read.schema(dfSchema).parquet(p)).select(df.columns.head, 
> df.columns.tail:_*)) df }
> val dfAll = readDir(absolutePath)
>  dfAll.count
>  The count of produced df is 4, which in this example should be 10. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470634#comment-16470634
 ] 

Dongjoon Hyun edited comment on SPARK-24204 at 5/10/18 4:28 PM:


Thank you for pinging me, [~maropu]. Could you make a PR with your patch?
We need a general patch for JSON/Parquet/ORC like CSV.


was (Author: dongjoon):
Thank you for pinging me, [~maropu]. Could you make a PR with your patch?

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24204:
--
Description: 
*SUMMARY*
- CSV: Raising analysis exception.
- JSON: dropping columns with null types
- Parquet/ORC: raising runtime exceptions

The native orc file format throws an exception with a meaningless message in 
executor-sides when unsupported types passed;
{code}

scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
null)))
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("b", NullType) :: Nil)
scala> val df = spark.createDataFrame(rdd, schema)
scala> df.write.orc("/tmp/orc")
java.lang.IllegalArgumentException: Can't parse category at 
'struct'
at 
org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
er.scala:226)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
(FileFormatWriter.scala:278)
{code}
It seems to be better to verify a write schema in a driver side for users along 
with the CSV fromat;
https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65

  was:
The native orc file format throws an exception with a meaningless message in 
executor-sides when unsupported types passed;
{code}

scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
null)))
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("b", NullType) :: Nil)
scala> val df = spark.createDataFrame(rdd, schema)
scala> df.write.orc("/tmp/orc")
java.lang.IllegalArgumentException: Can't parse category at 
'struct'
at 
org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
er.scala:226)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
(FileFormatWriter.scala:278)
{code}
It seems to be better to verify a write schema in a driver side for users along 
with the CSV fromat;
https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65


> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>

[jira] [Updated] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24204:
--
Summary: Verify a write schema in Json/Orc/ParquetFileFormat  (was: Verify 
a write schema in Orc/ParquetFileFormat)

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24204) Verify a write schema in Orc/ParquetFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24204:
--
Summary: Verify a write schema in Orc/ParquetFileFormat  (was: Verify a 
write schema in OrcFileFormat)

> Verify a write schema in Orc/ParquetFileFormat
> --
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24204) Verify a write schema in OrcFileFormat

2018-05-10 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470634#comment-16470634
 ] 

Dongjoon Hyun commented on SPARK-24204:
---

Thank you for pinging me, [~maropu]. Could you make a PR with your patch?

> Verify a write schema in OrcFileFormat
> --
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24197) add array_sort function

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24197:


Assignee: Apache Spark

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Apache Spark
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24197) add array_sort function

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470616#comment-16470616
 ] 

Apache Spark commented on SPARK-24197:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21294

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24197) add array_sort function

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24197:


Assignee: (was: Apache Spark)

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-05-10 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470552#comment-16470552
 ] 

Jose Torres commented on SPARK-24036:
-

My concern isn't that we'll have to write more code, but that changing 
scheduler internals expands the surface area of interactions that need to be 
considered. For example, can we confidently enumerate all the ways in which the 
scheduler assumes a Dependency defines a stage boundary? If so, can we change 
all of them in a way that doesn't impact non-continuous-processing code at all? 
We'd have to consider a lot of questions like that, and I don't see any large 
benefit we'd get from doing so.

 

Glad to take a look at your preview PR.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24237) continuous shuffle dependency

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470546#comment-16470546
 ] 

Apache Spark commented on SPARK-24237:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/21293

> continuous shuffle dependency
> -
>
> Key: SPARK-24237
> URL: https://issues.apache.org/jira/browse/SPARK-24237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> We might not need this to be an actual org.apache.spark.Dependency. We need 
> to somehow register with MapOutputTracker, or write our own custom tracker if 
> this ends up being infeasible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24237) continuous shuffle dependency

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24237:


Assignee: Apache Spark

> continuous shuffle dependency
> -
>
> Key: SPARK-24237
> URL: https://issues.apache.org/jira/browse/SPARK-24237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> We might not need this to be an actual org.apache.spark.Dependency. We need 
> to somehow register with MapOutputTracker, or write our own custom tracker if 
> this ends up being infeasible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24237) continuous shuffle dependency

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24237:


Assignee: (was: Apache Spark)

> continuous shuffle dependency
> -
>
> Key: SPARK-24237
> URL: https://issues.apache.org/jira/browse/SPARK-24237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> We might not need this to be an actual org.apache.spark.Dependency. We need 
> to somehow register with MapOutputTracker, or write our own custom tracker if 
> this ends up being infeasible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24227) Not able to submit spark job to kubernetes on 2.3

2018-05-10 Thread Felipe Cavalcanti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470493#comment-16470493
 ] 

Felipe Cavalcanti commented on SPARK-24227:
---

solved it, spark was using the basic auth from the current-cluster set in 
~/.kube/config, two problems there:

1 - the current-cluster is not the only one I have and I was trying to submit 
the job to another one

1 - I do not use basic-auth in my cluster, I have disabled it and then I had to 
delete the basic auth info from the config file so that spark-submit did not 
send it

and then it worked well.

> Not able to submit spark job to kubernetes on 2.3
> -
>
> Key: SPARK-24227
> URL: https://issues.apache.org/jira/browse/SPARK-24227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.3.0
>Reporter: Felipe Cavalcanti
>Priority: Major
>  Labels: kubernetes, spark
>
> Hi, I'm trying to submit a spark job to kubernetes with no success, I 
> followed the steps @ 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html] with no 
> success, when I run:
>  
> {code:java}
> bin/spark-submit \
>   --master k8s://https://${host}:${port} \
>   --deploy-mode cluster \ 
>   --name jaeger-spark \
>   --class io.jaegertracing.spark.dependencies.DependenciesSparkJob \
>   --conf spark.executor.instances=5 \
>   --conf spark.kubernetes.container.image=bla/jaeger-deps-spark:latest\
>   --conf spark.kubernetes.namespace=spark \
>   local:///opt/spark/jars/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar
> {code}
>  
> Im getting the following stack trace:
> {code:java}
> 2018-05-09 17:06:02 WARN WatchConnectionManager:192 - Exec Failure 
> javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target at 
> sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at 
> sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at 
> sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at 
> sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at 
> sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
>  at 
> sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) 
> at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026) at 
> sun.security.ssl.Handshaker.process_record(Handshaker.java:961) at 
> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
>  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at 
> okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
>  at 
> okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
>  at 
> okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151) 
> at 
> okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
>  at 
> okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
>  at 
> okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
>  at 
> okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>  at 
> okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) 
> at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>  at 
> okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) 
> at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>  at 
> okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>  at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:90)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
>  at 
> okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
>  at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at 
> okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) at 
> 

[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470361#comment-16470361
 ] 

Apache Spark commented on SPARK-24068:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21292

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-10 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470333#comment-16470333
 ] 

Stavros Kontopoulos edited comment on SPARK-24232 at 5/10/18 12:43 PM:
---

Check also what we use on mesos for naming conventions for env secrets: 
[https://spark.apache.org/docs/latest/running-on-mesos.html]

I will give it a shot seems straightforward to add. Any reason why this 
shouldn't be added? 


was (Author: skonto):
Check also what we use on mesos for naming conventions for env secrets: 
[https://spark.apache.org/docs/latest/running-on-mesos.html]

Might work on this seems straightforward to add.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-05-10 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470339#comment-16470339
 ] 

Sahil Takiar commented on SPARK-24243:
--

[~vanzin] would adding something like this be possible? Perhaps we can add a 
{{getThrowable}} method to {{SparkAppHandle}}?

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Major
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11150) Dynamic partition pruning

2018-05-10 Thread tim geary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469551#comment-16469551
 ] 

tim geary edited comment on SPARK-11150 at 5/10/18 12:42 PM:
-

I have a customer that  is asking on status of this, it has been open for a 
couple years.

Can I get an update when this will be addressed?

Is partition pruning available 

 


was (Author: tge...@cloudera.com):
Ice/nyse is asking on status of this, it has been open for a couple years.

Can I get an update when this will be addressed?

 

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Priority: Major
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-05-10 Thread Sahil Takiar (JIRA)
Sahil Takiar created SPARK-24243:


 Summary: Expose exceptions from InProcessAppHandle
 Key: SPARK-24243
 URL: https://issues.apache.org/jira/browse/SPARK-24243
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Sahil Takiar


{{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
exceptions thrown are logged and then the state is set to {{FAILED}}. It would 
be nice to expose the {{Throwable}} object  to the application rather than 
logging it and dropping it. Applications may want to manipulate the underlying 
{{Throwable}} / control its logging at a finer granularity. For example, the 
app might want to call {{Throwables.getRootCause(throwable).getMessage()}} and 
expose the message to the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24232) Allow referring to kubernetes secrets as env variable

2018-05-10 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470333#comment-16470333
 ] 

Stavros Kontopoulos commented on SPARK-24232:
-

Check also what we use on mesos for naming conventions for env secrets: 
[https://spark.apache.org/docs/latest/running-on-mesos.html]

Might work on this seems straightforward to add.

> Allow referring to kubernetes secrets as env variable
> -
>
> Key: SPARK-24232
> URL: https://issues.apache.org/jira/browse/SPARK-24232
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> Allow referring to kubernetes secrets in the driver process via environment 
> variables. This will allow developers to use secretes without leaking them in 
> the code and at the same time secrets can be decoupled and managed 
> separately. This can be used to refer to passwords, certificates etc while 
> talking to other service (jdbc passwords, storage keys etc).
> So, at the deployment time, something like 
> ``spark.kubernetes.driver.secretKeyRef.[EnvName]=`` can be specified 
> which will make [EnvName].[key] available as an environment variable and in 
> the code its always referred as env variable [key].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread spark_user (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469858#comment-16469858
 ] 

spark_user edited comment on SPARK-24217 at 5/10/18 12:22 PM:
--

 
 
 Hi Joseph K Bradley,

 

For the same input in spark.ml and spark.mllib, spark.mllib giving cluster id 
for all the vertices.

 

For eg:

      id       neighbor          similarity                               

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml* 

     id prediction  

      1       0

       6     1

 Input in spark.mllib

      id     neighbor    similarity                               

       1       2                1.0

       1       3                1.0

       1       4                1.0

       1       5                1.0

       6       7                1.0

       6       8               1.0

       6       9               1.0

       6      10               1.0 

Output in *spark.mllib*

     Id prediction

      1      0

       2     0

       3     0

       4     0

      5     0

      6     1

     7      1

     8     1

     9     1

    10   1

 

 


was (Author: shahid):
Hi Joseph K Bradley,

For the same input in spark.ml and spark.mllib, spark.mllib giving cluster id 
for all the vertices.

 

For eg:

      id       neighbor          similarity 

       1       [ 2, 3, 4, 5]    [ 1.0, 1.0, 1.0, 1.0]  

       6     [  7, 8 , 9, 10]   [1.0 1.0 1.0 1.0]  

 

Output in *spark.ml* 

     id prediction  

      1       0

        6     1

 

Output in *spark.mllib*

     Id prediction

      1      0

       2     0

       3     0

       4     0

      5    0

      6     1

     7      1

   8       1

   9   1

    10   1

 

 

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23907) Support regr_* functions

2018-05-10 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23907.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21054
[https://github.com/apache/spark/pull/21054]

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23907) Support regr_* functions

2018-05-10 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23907:
-

Assignee: Marco Gaido

> Support regr_* functions
> 
>
> Key: SPARK-23907
> URL: https://issues.apache.org/jira/browse/SPARK-23907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-15978
> {noformat}
> Support the standard regr_* functions, regr_slope, regr_intercept, regr_r2, 
> regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. SQL reference 
> section 10.9
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470201#comment-16470201
 ] 

Apache Spark commented on SPARK-24242:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21291

> RangeExec should have correct outputOrdering
> 
>
> Key: SPARK-24242
> URL: https://issues.apache.org/jira/browse/SPARK-24242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Logical Range node has been added with outputOrdering recently. It's used to 
> eliminate redundant Sort during optimization. However, this outputOrdering 
> info doesn't not propagate to physical Range node. We should use this 
> outputOrdering from logical Range node so parent nodes of Range can correctly 
> know the output ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24242:


Assignee: Apache Spark

> RangeExec should have correct outputOrdering
> 
>
> Key: SPARK-24242
> URL: https://issues.apache.org/jira/browse/SPARK-24242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Logical Range node has been added with outputOrdering recently. It's used to 
> eliminate redundant Sort during optimization. However, this outputOrdering 
> info doesn't not propagate to physical Range node. We should use this 
> outputOrdering from logical Range node so parent nodes of Range can correctly 
> know the output ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24242:


Assignee: (was: Apache Spark)

> RangeExec should have correct outputOrdering
> 
>
> Key: SPARK-24242
> URL: https://issues.apache.org/jira/browse/SPARK-24242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Logical Range node has been added with outputOrdering recently. It's used to 
> eliminate redundant Sort during optimization. However, this outputOrdering 
> info doesn't not propagate to physical Range node. We should use this 
> outputOrdering from logical Range node so parent nodes of Range can correctly 
> know the output ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-10 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-24242:
---

 Summary: RangeExec should have correct outputOrdering
 Key: SPARK-24242
 URL: https://issues.apache.org/jira/browse/SPARK-24242
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Logical Range node has been added with outputOrdering recently. It's used to 
eliminate redundant Sort during optimization. However, this outputOrdering info 
doesn't not propagate to physical Range node. We should use this outputOrdering 
from logical Range node so parent nodes of Range can correctly know the output 
ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24241) Do not fail fast when dynamic resource allocation enabled with 0 executor

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24241:


Assignee: (was: Apache Spark)

> Do not fail fast when dynamic resource allocation enabled with 0 executor
> -
>
> Key: SPARK-24241
> URL: https://issues.apache.org/jira/browse/SPARK-24241
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Kent Yao
>Priority: Minor
>
> {code:java}
> ~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf 
> spark.dynamicAllocation.enabled=true
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; 
> support was removed in 8.0
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; 
> support was removed in 8.0
> Error: Number of executors must be a positive number
> Run with --help for usage help or --verbose for debug output
> {code}
> Actually, we could start up with min executor number with 0 before 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24241) Do not fail fast when dynamic resource allocation enabled with 0 executor

2018-05-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24241:


Assignee: Apache Spark

> Do not fail fast when dynamic resource allocation enabled with 0 executor
> -
>
> Key: SPARK-24241
> URL: https://issues.apache.org/jira/browse/SPARK-24241
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> ~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf 
> spark.dynamicAllocation.enabled=true
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; 
> support was removed in 8.0
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; 
> support was removed in 8.0
> Error: Number of executors must be a positive number
> Run with --help for usage help or --verbose for debug output
> {code}
> Actually, we could start up with min executor number with 0 before 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >