date:20170811

[jira] [Issue Comment Deleted] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahesh Ambule updated SPARK-21711:
--
Comment: was deleted

(was: Here by spark client, I meant java client process started by 
spark-submit. I want to provide Java options to this client process. 
I am talking about java client which invokes yarn client process and launches 
the driver and executor processes.)

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124457#comment-16124457
 ] 

Mahesh Ambule commented on SPARK-21711:
---

My Application > spark-submit  > spark client ---> yarn client ---> 
driver ---> executors

I want to configure log4j Java options for spark client and yarn client, but 
spark-submit does not have the option to configure it.


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-11 Thread duyanghao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124421#comment-16124421
 ] 

duyanghao edited comment on SPARK-18085 at 8/12/17 3:11 AM:


[~vanzin] Could you have a summary of your progress(solved problems and 
unsolved problems) since we all have the same problems for loading large event 
logs(100G+) and loading history summary page when the event log directory is 
large.

We all want to know how good your solution is(of course from your test results) 
and your plans to do it.


was (Author: duyanghao):
[~vanzin] Could you have a summary of your progress(solved problems and 
unsolved problems) since we all have the same problems for loading large event 
logs(100G+) and loading history summary page when the event log directory is 
large.

We all want to know how good your solution is(of course from your test results)?

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-11 Thread duyanghao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124421#comment-16124421
 ] 

duyanghao commented on SPARK-18085:
---

[~vanzin] Could you have a summary of your progress(solved problems and 
unsolved problems) since we all have the same problems for loading large event 
logs(100G+) and loading history summary page when the event log directory is 
large.

We all want to know how good your solution is(of course from your test results)?

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21689) Spark submit will not get kerberos token token when hbase class not found

2017-08-11 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang closed SPARK-21689.

Resolution: Won't Fix

> Spark submit will not get kerberos token token when hbase class not found
> -
>
> Key: SPARK-21689
> URL: https://issues.apache.org/jira/browse/SPARK-21689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
>
> When use yarn cluster mode,and we need scan hbase,there will be a case which 
> can not work:
> If we put user jar on hdfs,when local classpath will has no hbase,which will 
> let get hbase token failed.Then later when job submitted to yarn, it will 
> failed since has no token to access hbase table.I mock three cases:
> 1：user jar is on classpath, and has hbase
> {code:java}
> 17/08/10 13:48:03 INFO security.HadoopFSDelegationTokenProvider: Renewal 
> interval is 86400050 for token HDFS_DELEGATION_TOKEN
> 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hive
> 17/08/10 13:48:03 INFO security.HadoopDelegationTokenManager: Service hbase
> 17/08/10 13:48:05 INFO security.HBaseDelegationTokenProvider: Attempting to 
> fetch HBase security token.
> {code}
> Logs showing we can get token normally.
> 2：user jar on hdfs
> {code:java}
> 17/08/10 13:43:58 WARN security.HBaseDelegationTokenProvider: Class 
> org.apache.hadoop.hbase.HBaseConfiguration not found.
> 17/08/10 13:43:58 INFO security.HBaseDelegationTokenProvider: Failed to get 
> token from service hbase
> java.lang.ClassNotFoundException: 
> org.apache.hadoop.hbase.security.token.TokenUtil
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:41)
>   at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:112)
>   at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:109)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> {code}
> Logs showing we can get token failed with ClassNotFoundException.
> If we download user jar from remote first,then things will work correctly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-11 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-18278:
---
Attachment: SPARK-18278 Spark on Kubernetes Design Proposal Revision 2 
(1).pdf

Revision 2 of the Design Proposal

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf, 
> SPARK-18278 Spark on Kubernetes Design Proposal Revision 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12559) Cluster mode doesn't work with --packages

2017-08-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12559.

   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.3.0

> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Stavros Kontopoulos
> Fix For: 2.3.0
>
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21710) ConsoleSink causes OOM crashes with large inputs.

2017-08-11 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124200#comment-16124200
 ] 

Shixiong Zhu commented on SPARK-21710:
--

`collect` is a workaround for https://issues.apache.org/jira/browse/SPARK-16264

> ConsoleSink causes OOM crashes with large inputs.
> -
>
> Key: SPARK-21710
> URL: https://issues.apache.org/jira/browse/SPARK-21710
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: affects all environments
>Reporter: Gerard Maas
>  Labels: easyfix
>
> ConsoleSink does a full collect of the streaming dataset in order to show few 
> lines on screen. This is problematic with large inputs, like a kafka backlog 
> or a file source with files larger than the driver's memory.
> Here's an example:
> {code:java}
> import spark.implicits._
> import org.apache.spark.sql.functions
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("text", StringType, true) :: Nil)
> val lines = spark
>   .readStream
>   .format("text")
>   .option("path", "/tmp/data")
>   .schema(schema)
>   .load()
> val base = lines.writeStream
>   .outputMode("append")
>   .format("console")
>   .start()
> {code}
> When a large file larger than the available driver memory is fed through this 
> streaming job, we get:
> {code:java}
> ---
> Batch: 0
> ---
> [Stage 0:>(0 + 8) / 
> 111]17/08/11 15:10:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
>   at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> 17/08/11 15:10:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker for task 6,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {code}
> This issue can be traced back to a `collect` on the source `DataFrame`:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala#L52
> A fairly simple solution would be to do a `take(numRows)` instead of the 
> collect. (PR in progress)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19122) Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order

2017-08-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19122.
-
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.3.0

> Unnecessary shuffle+sort added if join predicates ordering differ from 
> bucketing and sorting order
> --
>
> Key: SPARK-19122
> URL: https://issues.apache.org/jira/browse/SPARK-19122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3.0
>
>
> `table1` and `table2` are sorted and bucketed on columns `j` and `k` (in 
> respective order)
> This is how they are generated:
> {code}
> val df = (0 until 16).map(i => (i % 8, i * 2, i.toString)).toDF("i", "j", 
> "k").coalesce(1)
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table1")
> df.write.format("org.apache.spark.sql.hive.orc.OrcFileFormat").bucketBy(8, 
> "j", "k").sortBy("j", "k").saveAsTable("table2")
> {code}
> Now, if join predicates are specified in query in *same* order as bucketing 
> and sort order, there is no shuffle and sort.
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.j=b.j AND 
> a.k=b.k").explain(true)
> == Physical Plan ==
> *SortMergeJoin [j#61, k#62], [j#100, k#101], Inner
> :- *Project [i#60, j#61, k#62]
> :  +- *Filter (isnotnull(k#62) && isnotnull(j#61))
> : +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, Format: 
> ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Project [i#99, j#100, k#101]
>+- *Filter (isnotnull(j#100) && isnotnull(k#101))
>   +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}
> The same query with join predicates in *different* order from bucketing and 
> sort order leads to extra shuffle and sort being introduced
> {code}
> scala> hc.sql("SET spark.sql.autoBroadcastJoinThreshold=1")
> scala> hc.sql("SELECT * FROM table1 a JOIN table2 b ON a.k=b.k AND a.j=b.j 
> ").explain(true)
> == Physical Plan ==
> *SortMergeJoin [k#62, j#61], [k#101, j#100], Inner
> :- *Sort [k#62 ASC NULLS FIRST, j#61 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(k#62, j#61, 200)
> : +- *Project [i#60, j#61, k#62]
> :+- *Filter (isnotnull(k#62) && isnotnull(j#61))
> :   +- *FileScan orc default.table1[i#60,j#61,k#62] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(k), IsNotNull(j)], ReadSchema: 
> struct
> +- *Sort [k#101 ASC NULLS FIRST, j#100 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k#101, j#100, 200)
>   +- *Project [i#99, j#100, k#101]
>  +- *Filter (isnotnull(j#100) && isnotnull(k#101))
> +- *FileScan orc default.table2[i#99,j#100,k#101] Batched: false, 
> Format: ORC, Location: InMemoryFileIndex[file:/table2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(j), IsNotNull(k)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21698.
-
Resolution: Won't Fix

> write.partitionBy() is giving me garbage data
> -
>
> Key: SPARK-21698
> URL: https://issues.apache.org/jira/browse/SPARK-21698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
> Environment: Linux Ubuntu 17.04.  Python 3.5.
>Reporter: Luis
>
> Spark partionBy is causing some data corruption.  I am doing three super 
> simple writes. . Below is the code to reproduce the problem.
> {code:title=Program Output|borderStyle=solid}
> 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
> /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
> schema from dict is deprecated,please use pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> +---++-+  
>   
> | id|name|count|
> +---++-+
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> +---++-+
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> +---++-+
> | id|name|count|
> +---++-+
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> |  4|   4|4|
> |  5|   5|5|
> |  6|   6|6|
> +---++-+
> +---++-+
> | id|name|count|
> +---++-+
> |  9|   4| null|
> | 10|   6| null|
> |  7|   1| null|
> |  8|   2| null|
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> |  4|   4|4|
> |  5|   5|5|
> |  6|   6|6|
> +---++-+
> {code}
> In the last show(). I see the data is null 
> {code:title=spark init|borderStyle=solid}
> self.spark = SparkSession \
> .builder \
> .master("spark://localhost:7077") \
> .enableHiveSupport() \
> .getOrCreate()
> {code}
> {code:title=Code for the test case|borderStyle=solid}
> def test_clean_insert_table(self):
> table_name = "data"
> data0 = [
> {"id": 1, "name":"1", "count": 1},
> {"id": 2, "name":"2", "count": 2},
> {"id": 3, "name":"3", "count": 3},
> ]
> df_data0 = self.spark.createDataFrame(data0)
> 
> df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data1 = [
> {"id": 4, "name":"4", "count": 4},
> {"id": 5, "name":"5", "count": 5},
> {"id": 6, "name":"6", "count": 6},
> ]
> df_data1 = self.spark.createDataFrame(data1)
> df_data1.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data3 = [
> {"id": 1, "name":"one", "count":7},
> {"id": 2, "name":"two", "count": 8},
> {"id": 4, "name":"three", "count": 9},
> {"id": 6, "name":"six", "count":10}
> ]
> data3 = self.spark.createDataFrame(data3)
> data3.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21698) write.partitionBy() is giving me garbage data

2017-08-11 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124161#comment-16124161
 ] 

Xiao Li commented on SPARK-21698:
-

{{insertInto}} is resolved by positions. You can printout the schema of the 
DataFrame {{data3}} and the table schema. You will find they do not match. 
Thus, the data are inserted to wrong columns with different data types.

The simplest walkaround is to use saveAsTable with append mode.
Change this line 
{{data3.write.insertInto(table_name)}}
to
{{data3.write.partitionBy("count").mode("append").saveAsTable(table_name)}}




> write.partitionBy() is giving me garbage data
> -
>
> Key: SPARK-21698
> URL: https://issues.apache.org/jira/browse/SPARK-21698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
> Environment: Linux Ubuntu 17.04.  Python 3.5.
>Reporter: Luis
>
> Spark partionBy is causing some data corruption.  I am doing three super 
> simple writes. . Below is the code to reproduce the problem.
> {code:title=Program Output|borderStyle=solid}
> 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
> /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring 
> schema from dict is deprecated,please use pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> +---++-+  
>   
> | id|name|count|
> +---++-+
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> +---++-+
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> +---++-+
> | id|name|count|
> +---++-+
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> |  4|   4|4|
> |  5|   5|5|
> |  6|   6|6|
> +---++-+
> +---++-+
> | id|name|count|
> +---++-+
> |  9|   4| null|
> | 10|   6| null|
> |  7|   1| null|
> |  8|   2| null|
> |  1|   1|1|
> |  2|   2|2|
> |  3|   3|3|
> |  4|   4|4|
> |  5|   5|5|
> |  6|   6|6|
> +---++-+
> {code}
> In the last show(). I see the data is null 
> {code:title=spark init|borderStyle=solid}
> self.spark = SparkSession \
> .builder \
> .master("spark://localhost:7077") \
> .enableHiveSupport() \
> .getOrCreate()
> {code}
> {code:title=Code for the test case|borderStyle=solid}
> def test_clean_insert_table(self):
> table_name = "data"
> data0 = [
> {"id": 1, "name":"1", "count": 1},
> {"id": 2, "name":"2", "count": 2},
> {"id": 3, "name":"3", "count": 3},
> ]
> df_data0 = self.spark.createDataFrame(data0)
> 
> df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data1 = [
> {"id": 4, "name":"4", "count": 4},
> {"id": 5, "name":"5", "count": 5},
> {"id": 6, "name":"6", "count": 6},
> ]
> df_data1 = self.spark.createDataFrame(data1)
> df_data1.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data3 = [
> {"id": 1, "name":"one", "count":7},
> {"id": 2, "name":"two", "count": 8},
> {"id": 4, "name":"three", "count": 9},
> {"id": 6, "name":"six", "count":10}
> ]
> data3 = self.spark.createDataFrame(data3)
> data3.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21200) Spark REST API is not working or Spark documentation is wrong.

2017-08-11 Thread Rahul Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Gupta updated SPARK-21200:

Comment: was deleted

(was: sample of API working in spark 1.4)

> Spark REST API is not working or Spark documentation is wrong.
> --
>
> Key: SPARK-21200
> URL: https://issues.apache.org/jira/browse/SPARK-21200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Srinivasarao Daruna
>
> unable to access spark REST Api.
> Was able to access it as per the documentation in older version of spark. 
> But, with spark 2.1.1 when i tried to do the same, it is not working.
> Either a code bug or documentation has to be updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21200) Spark REST API is not working or Spark documentation is wrong.

2017-08-11 Thread Rahul Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Gupta updated SPARK-21200:

Attachment: (was: Screen Shot 2017-08-11 at 2.29.38 PM.png)

> Spark REST API is not working or Spark documentation is wrong.
> --
>
> Key: SPARK-21200
> URL: https://issues.apache.org/jira/browse/SPARK-21200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Srinivasarao Daruna
>
> unable to access spark REST Api.
> Was able to access it as per the documentation in older version of spark. 
> But, with spark 2.1.1 when i tried to do the same, it is not working.
> Either a code bug or documentation has to be updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21200) Spark REST API is not working or Spark documentation is wrong.

2017-08-11 Thread Rahul Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Gupta updated SPARK-21200:

Attachment: Screen Shot 2017-08-11 at 2.29.38 PM.png

sample of API working in spark 1.4

> Spark REST API is not working or Spark documentation is wrong.
> --
>
> Key: SPARK-21200
> URL: https://issues.apache.org/jira/browse/SPARK-21200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Srinivasarao Daruna
>
> unable to access spark REST Api.
> Was able to access it as per the documentation in older version of spark. 
> But, with spark 2.1.1 when i tried to do the same, it is not working.
> Either a code bug or documentation has to be updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21200) Spark REST API is not working or Spark documentation is wrong.

2017-08-11 Thread Rahul Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124116#comment-16124116
 ] 

Rahul Gupta commented on SPARK-21200:
-

[~sowen] I was facing similar issue, and wanted to know the best way to report 
this issue.
Tracing down to code, in Spark 1.6 the rest API was exposed via the Master UI
./core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala:
attachHandler(ApiRootResource.getServletHandler(this))

Spark 2.0 onwards, I see this was removed and no way to access the 
api/v1/applications from master url
hitting the URL 
http://localhost:8080/api/v1/applications renders the same page as 
http://localhost:8080

Thanks,
Rahul

> Spark REST API is not working or Spark documentation is wrong.
> --
>
> Key: SPARK-21200
> URL: https://issues.apache.org/jira/browse/SPARK-21200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Srinivasarao Daruna
>
> unable to access spark REST Api.
> Was able to access it as per the documentation in older version of spark. 
> But, with spark 2.1.1 when i tried to do the same, it is not working.
> Either a code bug or documentation has to be updated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124012#comment-16124012
 ] 

PJ Fanning commented on SPARK-21708:


[~srowen] Your point about IDEs is valid. IntelliJ IDEA has support 
(https://blog.jetbrains.com/scala/2017/07/19/intellij-idea-scala-plugin-2017-2-sbt-1-0-improved-sbt-shell-play-2-6-and-better-implicits-management/)
 and hopefully the sbteclipse plugin for generating eclipse workspaces from sbt 
files will be updated soon. All in all, there are quite a number of sbt plugins 
to upgrade before the sbt version can be raised to 1.0.0.
So, by the time we are in a position to switch to 1.0.0, it should be easier 
for developers to adapt. 

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Attachment: ResponseContent.png

> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.
> !https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Description: 
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.
!https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!
!https://issues.apache.org/jira/secure/attachment/12881535/ResponseContent.png!

  was:
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.
!https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!


> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history

[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Description: 
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.
!https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!

  was:
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.



> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Performance.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download

[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Description: 
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.


  was:
UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.


> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Performance.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
>

[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123967#comment-16123967
 ] 

Mahesh Ambule edited comment on SPARK-21711 at 8/11/17 8:18 PM:


Here by spark client, I meant java client process started by spark-submit. I 
want to provide Java options to this client process. 
I am talking about java client which invokes yarn client process and launches 
the driver and executor processes.


was (Author: mahesh_ambule):
Here by spark client, I meant java client process started by spark-submit. I 
want to provide Java options to that process. 
I am talking about java client which invokes yarn client process and launches 
the driver and executor processes.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Attachment: Performance.png

> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Performance.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-11 Thread Ye Zhou (JIRA)

Ye Zhou created SPARK-21715:
---

 Summary: History Server respondes history page html content 
multiple times for only one http request
 Key: SPARK-21715
 URL: https://issues.apache.org/jira/browse/SPARK-21715
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0, 2.3.0
Reporter: Ye Zhou
Priority: Minor
 Fix For: 2.3.0


UI looks fine for the home page. But we check the performance for each 
individual components, we found that there are three picture downloading 
requests which takes much longer time than expected: favicon.ico, 
sort_both.png, sort_desc.png. 
These are the list of the request address: http://hostname:port/favicon.ico, 
http://hostname:port/images/sort_both.png, 
http://hostname:port/images/sort_desc.png. Later if user clicks on the head of 
the table to sort the column, another request for 
http://hostname:port/images/sort_asc.png will be sent.

Browsers will send request for favicon.ico in default. And all these three 
sort_xxx.png are the default behavior in dataTables jQuery plugin.

Spark history server will start several handlers to handle http request. But 
none of these requests are getting correctly handled and they are all 
triggering the history server to respond the history page html content. As we 
can find from the screenshot, the response data type are all "text/html".

To solve this problem, We need to download those images dir from here: 
https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images. 
Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". We 
also need to modify the dataTables.bootstrap.css to get the correct images 
location. For favicon.ico downloading request, we need to add one line in the 
html header to disable the downloading. 

I can post a pull request if this is the correct way to fix this. I have tried 
it which works fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-08-11 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called 
"sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
 . After each database session is opened to the remote DB, and before starting 
to read data, this option executes a custom SQL statement (or a PL/SQL block in 
the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell --jars ojdb6.jar

val df = spark.read
   .format("jdbc")
   .option("url", "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name")
   .option("driver", "oracle.jdbc.driver.OracleDriver")
   .option("dbtable", "(select 1, sysdate, systimestamp, 
current_timestamp, localtimestamp from dual)")
   .option("user", "MYUSER")
   .option("password", "MYPASSWORD").option("fetchsize",1000)
   .option("sessionInitStatement", preambleSQL)
   .load()

df.show(5,false)
{code}

*Comments:* This proposal has been developed and tested for connecting the 
Spark JDBC data source to Oracle databases, however I believe it can be useful 
for other target DBs too, as it is quite generic.   
The code executed by the option "sessionInitStatement" is just the 
user-provided string fed through the execute method of the JDBC connection, so 
it can use the features of the target database language/syntax. When using 
sessionInitStatement for querying Oracle, for example, the user-provided 
command can be a SQL statement or a PL/SQL block grouping multiple commands and 
logic.   
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.  

  was:
This proposes an option to the JDBC datasource, tentatively called 
"sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
 . After each database session is opened to the remote DB, and before starting 
to read data, this option executes a custom SQL statement (or a PL/SQL block in 
the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell --jars ojdb6.jar

val df = spark.read
   .format("jdbc")
   .option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name")
   .option("driver", "oracle.jdbc.driver.OracleDriver")
   .option("dbtable", "(select 1, sysdate, systimestamp, 
current_timestamp, localtimestamp from dual)")
   .option("user", "MYUSER")
   .option("password", "MYPASSWORD").option("fetchsize",1000)
   .option("sessionInitStatement", preambleSQL)
   .load()

df.show(5,false)
{code}

*Comments:* This proposal has been developed and tested for connecting the 
Spark JDBC data source to Oracle databases, however I believe it can be useful 
for other target DBs too, as it is quite generic.   
The code executed by the option "sessionInitStatement" is just the 
user-provided string fed through the execute method of the JDBC connection, so 
it can use the features of the target database language/syntax. When using 
sessionInitStatement for querying Oracle, for example, the user-provided 
command can be a SQL statement or a PL/SQL block grouping multiple commands and 
logic.   
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.  


> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-08-11 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called 
"sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
 . After each database session is opened to the remote DB, and before starting 
to read data, this option executes a custom SQL statement (or a PL/SQL block in 
the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell --jars ojdb6.jar

val df = spark.read
   .format("jdbc")
   .option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name")
   .option("driver", "oracle.jdbc.driver.OracleDriver")
   .option("dbtable", "(select 1, sysdate, systimestamp, 
current_timestamp, localtimestamp from dual)")
   .option("user", "MYUSER")
   .option("password", "MYPASSWORD").option("fetchsize",1000)
   .option("sessionInitStatement", preambleSQL)
   .load()

df.show(5,false)
{code}

*Comments:* This proposal has been developed and tested for connecting the 
Spark JDBC data source to Oracle databases, however I believe it can be useful 
for other target DBs too, as it is quite generic.   
The code executed by the option "sessionInitStatement" is just the 
user-provided string fed through the execute method of the JDBC connection, so 
it can use the features of the target database language/syntax. When using 
sessionInitStatement for querying Oracle, for example, the user-provided 
command can be a SQL statement or a PL/SQL block grouping multiple commands and 
logic.   
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.  

  was:
This proposes an option to the JDBC datasource, tentatively called 
"sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
 . After each database session is opened to the remote DB, and before starting 
to read data, this option executes a custom SQL statement (or a PL/SQL block in 
the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell --jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
>
> This proposes an option to the JDBC datasource, tentatively called 
> "sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
>

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123967#comment-16123967
 ] 

Mahesh Ambule commented on SPARK-21711:
---

Here by spark client, I meant java client process started by spark-submit. I 
want to provide Java options to that process. 
I am talking about java client which invokes yarn client process and launches 
the driver and executor processes.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-08-11 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called 
"sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
 . After each database session is opened to the remote DB, and before starting 
to read data, this option executes a custom SQL statement (or a PL/SQL block in 
the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell --jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
>
> This proposes an option to the JDBC datasource, tentatively called 
> "sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
>  . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set

[jira] [Resolved] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2017-08-11 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-21595.
---
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.3.0
   2.2.1

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: documentation, regression
> Fix For: 2.2.1, 2.3.0
>
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123957#comment-16123957
 ] 

Sean Owen commented on SPARK-21711:
---

Oh, well that's your own application. You configure your own logging however 
you want. I didn't think that's what you're asking, because it's not a Spark 
issue.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123941#comment-16123941
 ] 

Mahesh Ambule commented on SPARK-21711:
---

I dont want to configure executor or driver java options. I want to configure 
spark client java options.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123911#comment-16123911
 ] 

Sean Owen commented on SPARK-21711:
---

It does, I'm referring to the spark.executor.extraJavaOptions config parameter.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123889#comment-16123889
 ] 

Mahesh Ambule edited comment on SPARK-21711 at 8/11/17 7:36 PM:


@Sean Owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' 
param to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to this java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?



was (Author: mahesh_ambule):
Sean Owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to this java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123889#comment-16123889
 ] 

Mahesh Ambule edited comment on SPARK-21711 at 8/11/17 7:27 PM:


Sean Owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to this java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?



was (Author: mahesh_ambule):
Sean Owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to client java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123889#comment-16123889
 ] 

Mahesh Ambule edited comment on SPARK-21711 at 8/11/17 7:22 PM:


Sean Owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to client java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?



was (Author: mahesh_ambule):
sean owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to client java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahesh Ambule updated SPARK-21711:
--
Attachment: spark-submit client logs.txt

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123889#comment-16123889
 ] 

Mahesh Ambule commented on SPARK-21711:
---

sean owen: Thanks for the reply. I tried to pass '-Dlog4j.configuration=' param 
to spark-submit command
but it did not configure spark client logging. Log4j configuration parameters 
passed through spark-submit 
command are getting configured for driver and executor JVMs but not for the 
spark client JVM. Here I am 
more interested in spark client application logs. For more clarity, please find 
attached the sample spark client logs file.

I went through spark-submit and related script files and found that 
spark-submit does not provide the option
to pass parameters to client application JVM. Below is the code snippet from 
'$spark_home/bin/spark-class' file.
This 'spark-class' file gets invoked by spark-submit.
_
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"_

Here $RUNNER is 'java' command and '$@' are arguments passed to spark-submit. 
There is no way to pass parameters to client java JVM.

Is my understanding correct? And can the option be provided to pass parameters 
to client Java JVM?


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-08-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21519.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
> "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
> systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
> "MYUSER").option("password", 
> "MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
> preambleSQL).load()
> df.show(5,false)
> {code}
> Comments: This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however I believe it can be 
> useful for other target DBs too, as it is quite generic.  
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities of injecting user-provided SQL (and 
> PL/SQL) that this opens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-08-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21519:
---

Assignee: Luca Canali

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
> "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
> systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
> "MYUSER").option("password", 
> "MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
> preambleSQL).load()
> df.show(5,false)
> {code}
> Comments: This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however I believe it can be 
> useful for other target DBs too, as it is quite generic.  
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities of injecting user-provided SQL (and 
> PL/SQL) that this opens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-11 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-21714:
-

 Summary: SparkSubmit in Yarn Client mode downloads remote files 
and then reuploads them again
 Key: SPARK-21714
 URL: https://issues.apache.org/jira/browse/SPARK-21714
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: Thomas Graves
Priority: Critical


SPARK-10643 added the ability for spark-submit to download remote file in 
client mode.

However in yarn mode this introduced a bug where it downloads them for the 
client but then yarn client just reuploads them to HDFS and uses them again. 
This should not happen when the remote file is HDFS.  This is wasting resources 
and its defeating the  distributed cache because if the original object was 
public it would have been shared by many users. By us downloading and 
reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2017-08-11 Thread Valeriy Avanesov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803
 ] 

Valeriy Avanesov commented on SPARK-5564:
-

I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.



> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123770#comment-16123770
 ] 

Sean Owen commented on SPARK-21711:
---

How about configuring the log4j config with -Dlog4j.configuration=... as a 
command line arg to the JVM? or packaging it in your app? I thought it picked 
up what you package in your application.

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21713) Replace LogicalPlan.isStreaming with OutputMode

2017-08-11 Thread Jose Torres (JIRA)

Jose Torres created SPARK-21713:
---

 Summary: Replace LogicalPlan.isStreaming with OutputMode
 Key: SPARK-21713
 URL: https://issues.apache.org/jira/browse/SPARK-21713
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Jose Torres


The isStreaming bit in LogicalPlan is based on an old model. Switching to 
OutputMode will allow us to more easily integrate with things that require 
specific OutputModes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2017-08-11 Thread Ratan Rai Sur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123738#comment-16123738
 ] 

Ratan Rai Sur commented on SPARK-21685:
---

The python wrapper is generated so I've pasted it here so you don't have to 
build it:


{code:java}
# Copyright (C) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See LICENSE in the project root for 
information.


import sys
if sys.version >= '3':
basestring = str

from pyspark.ml.param.shared import *
from pyspark import keyword_only
from pyspark.ml.util import JavaMLReadable, JavaMLWritable
from pyspark.ml.wrapper import JavaTransformer, JavaEstimator, JavaModel
from pyspark.ml.common import inherit_doc
from mmlspark.Utils import *

@inherit_doc
class _CNTKModel(ComplexParamsMixin, JavaMLReadable, JavaMLWritable, 
JavaTransformer):
"""
The ``CNTKModel`` evaluates a pre-trained CNTK model in parallel.  The
``CNTKModel`` takes a path to a model and automatically loads and
distributes the model to workers for parallel evaluation using CNTK's
java bindings.

The ``CNTKModel`` loads the pretrained model into the ``Function`` class
of CNTK.  One can decide which node of the CNTK Function computation
graph to evaluate by passing in the name of the output node with the
output node parameter.  Currently the ``CNTKModel`` supports single
input single output models.

The ``CNTKModel`` takes an input column which should be a column of
spark vectors and returns a column of spark vectors representing the
activations of the selected node.  By default, the CNTK model defaults
to using the model's first input and first output node.

Args:
inputCol (str): The name of the input column (undefined)
inputNode (int): index of the input node (default: 0)
miniBatchSize (int): size of minibatches (default: 10)
model (object): Array of bytes containing the serialized CNTKModel 
(undefined)
outputCol (str): The name of the output column (undefined)
outputNodeIndex (int): index of the output node (default: 0)
outputNodeName (str): name of the output node (undefined)
"""

@keyword_only
def __init__(self, inputCol=None, inputNode=0, miniBatchSize=10, 
model=None, outputCol=None, outputNodeIndex=0, outputNodeName=None):
super(_CNTKModel, self).__init__()
self._java_obj = self._new_java_obj("com.microsoft.ml.spark.CNTKModel")
self.inputCol = Param(self, "inputCol", "inputCol: The name of the 
input column (undefined)")
self.inputNode = Param(self, "inputNode", "inputNode: index of the 
input node (default: 0)")
self._setDefault(inputNode=0)
self.miniBatchSize = Param(self, "miniBatchSize", "miniBatchSize: size 
of minibatches (default: 10)")
self._setDefault(miniBatchSize=10)
self.model = Param(self, "model", "model: Array of bytes containing the 
serialized CNTKModel (undefined)")
self.outputCol = Param(self, "outputCol", "outputCol: The name of the 
output column (undefined)")
self.outputNodeIndex = Param(self, "outputNodeIndex", "outputNodeIndex: 
index of the output node (default: 0)")
self._setDefault(outputNodeIndex=0)
self.outputNodeName = Param(self, "outputNodeName", "outputNodeName: 
name of the output node (undefined)")
if hasattr(self, "_input_kwargs"):
kwargs = self._input_kwargs
else:
kwargs = self.__init__._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, inputNode=0, miniBatchSize=10, 
model=None, outputCol=None, outputNodeIndex=0, outputNodeName=None):
"""
Set the (keyword only) parameters

Args:
inputCol (str): The name of the input column (undefined)
inputNode (int): index of the input node (default: 0)
miniBatchSize (int): size of minibatches (default: 10)
model (object): Array of bytes containing the serialized CNTKModel 
(undefined)
outputCol (str): The name of the output column (undefined)
outputNodeIndex (int): index of the output node (default: 0)
outputNodeName (str): name of the output node (undefined)
"""
if hasattr(self, "_input_kwargs"):
kwargs = self._input_kwargs
else:
kwargs = self.__init__._input_kwargs
return self._set(**kwargs)

def setInputCol(self, value):
"""

Args:
inputCol (str): The name of the input column (undefined)

"""
self._set(inputCol=value)
return self


def getInputCol(self):
"""

Returns:
str: The name of the input column (undefined)
"""
return self.getOrDefault(self.inputCol)


def setInputNode(self, value):
"""

[jira] [Created] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-11 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-21712:


 Summary: Clarify PySpark Column.substr() type checking error 
message
 Key: SPARK-21712
 URL: https://issues.apache.org/jira/browse/SPARK-21712
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Nicholas Chammas
Priority: Trivial


https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409

"Can not mix the type" is really unclear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahesh Ambule updated SPARK-21711:
--
Description: 
Currently, log4j properties can be specified in spark 'conf' directory in 
log4j.properties file.
The spark-submit command can override these log4j properties for driver and 
executors. 
But it can not override these log4j properties for *spark client * application.

The user should be able to pass log4j properties for spark client using the 
spark-submit command.

  was:
Currently, log4j properties can be specified in spark 'conf' directory in 
log4j.properties file.
The spark-submit command can override these log4j properties for driver and 
executors logs. But it can not override log4j properties for spark and yarn 
clients in yarn-cluster mode.

The user should be able to pass log4j properties for spark and yarn clients 
using the spark-submit command.


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mahesh Ambule updated SPARK-21711:
--
Description: 
Currently, log4j properties can be specified in spark 'conf' directory in 
log4j.properties file.
The spark-submit command can override these log4j properties for driver and 
executors logs. But it can not override log4j properties for spark and yarn 
clients in yarn-cluster mode.

The user should be able to pass log4j properties for spark and yarn clients 
using the spark-submit command.

  was:
Currently, log4j properties can be specified in spark 'conf' directory in 
log4j.properties file.
The spark-submit command can override these log4j properties for driver and 
executors logs. But it can not override log4j properties for spark and yarn 
clients in yarn-cluster mode.

The user should be able to pass log4j properties for spark and yarn client 
using spark-submit command.


> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors logs. But it can not override log4j properties for spark and yarn 
> clients in yarn-cluster mode.
> The user should be able to pass log4j properties for spark and yarn clients 
> using the spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-11 Thread Mahesh Ambule (JIRA)

Mahesh Ambule created SPARK-21711:
-

 Summary: spark-submit command should accept log4j configuration 
parameters for spark client logging.
 Key: SPARK-21711
 URL: https://issues.apache.org/jira/browse/SPARK-21711
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.1.0, 1.6.0
Reporter: Mahesh Ambule
Priority: Minor


Currently, log4j properties can be specified in spark 'conf' directory in 
log4j.properties file.
The spark-submit command can override these log4j properties for driver and 
executors logs. But it can not override log4j properties for spark and yarn 
clients in yarn-cluster mode.

The user should be able to pass log4j properties for spark and yarn client 
using spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21710) ConsoleSink causes OOM crashes with large inputs.

2017-08-11 Thread Gerard Maas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123485#comment-16123485
 ] 

Gerard Maas commented on SPARK-21710:
-

PR: https://github.com/apache/spark/pull/18923

> ConsoleSink causes OOM crashes with large inputs.
> -
>
> Key: SPARK-21710
> URL: https://issues.apache.org/jira/browse/SPARK-21710
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: affects all environments
>Reporter: Gerard Maas
>  Labels: easyfix
>
> ConsoleSink does a full collect of the streaming dataset in order to show few 
> lines on screen. This is problematic with large inputs, like a kafka backlog 
> or a file source with files larger than the driver's memory.
> Here's an example:
> {code:java}
> import spark.implicits._
> import org.apache.spark.sql.functions
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("text", StringType, true) :: Nil)
> val lines = spark
>   .readStream
>   .format("text")
>   .option("path", "/tmp/data")
>   .schema(schema)
>   .load()
> val base = lines.writeStream
>   .outputMode("append")
>   .format("console")
>   .start()
> {code}
> When a large file larger than the available driver memory is fed through this 
> streaming job, we get:
> {code:java}
> ---
> Batch: 0
> ---
> [Stage 0:>(0 + 8) / 
> 111]17/08/11 15:10:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 
> 6)
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
>   at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
>   at java.io.DataOutputStream.write(DataOutputStream.java:107)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> 17/08/11 15:10:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker for task 6,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {code}
> This issue can be traced back to a `collect` on the source `DataFrame`:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala#L52
> A fairly simple solution would be to do a `take(numRows)` instead of the 
> collect. (PR in progress)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21692) Modify PythonUDF to support nullability

2017-08-11 Thread Michael Styles (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123461#comment-16123461
 ] 

Michael Styles commented on SPARK-21692:


https://github.com/apache/spark/pull/18906

> Modify PythonUDF to support nullability
> ---
>
> Key: SPARK-21692
> URL: https://issues.apache.org/jira/browse/SPARK-21692
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> When creating or registering Python UDFs, a user may know whether null values 
> can be returned by the function. PythonUDF and related classes should be 
> modified to support nullability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21708:
---
Description: 
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

According to https://github.com/sbt/sbt/issues/3424, we will need to change the 
HTTP location where we get the sbt-launch jar.

Other related issues:
SPARK-14401
https://github.com/typesafehub/sbteclipse/issues/343
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14


  was:
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/typesafehub/sbteclipse/issues/343
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14



> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123454#comment-16123454
 ] 

Sean Owen commented on SPARK-21708:
---

Yeah, I'm thinking about people that use built-in SBT integration IDEs, or who 
will use use `sbt`, etc. It's not a blocker, just something to consider.

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released.
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> Other related issues:
> SPARK-14401
> https://github.com/sbt/sbt/issues/3424
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123451#comment-16123451
 ] 

PJ Fanning commented on SPARK-21708:


[~srowen] the build/sbt scripting will download the preferred sbt version. With 
a good internet connection, it takes a couple of minutes.

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released.
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> Other related issues:
> SPARK-14401
> https://github.com/sbt/sbt/issues/3424
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123443#comment-16123443
 ] 

Sean Owen commented on SPARK-21656:
---

In the end I still don't quite agree with how you frame it here. It's making 
some jobs use more resource to let _other_ jobs move faster when bumping up 
against timeout limits. The downside of this change it no compelling just so 
that someone doesn't have to tune their job, so I'd discard that argument. It 
is compelling to solve the "busy driver" and "0 executor" problems. I'd have 
preferred to frame it that way from the get-go. This discussion isn't going to 
get farther, and agreeing on an outcome but disagreeing about why is close 
enough.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail:

[jira] [Comment Edited] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123439#comment-16123439
 ] 

Thomas Graves edited comment on SPARK-21656 at 8/11/17 2:43 PM:


Note, I've never said there is no counter part scenario and if you read what I 
said in the pr you will see that:

| It doesn't hurt the common case, the common case is all your executors have 
tasks on them as long as there are tasks to run. Normally scheduler can fill up 
the executors. It will use more resources if the scheduler takes time to put 
tasks on them, but that versus the time wasted in jobs that don't have enough 
executors to run on is hard to quantify because its going to be so application 
dependent. yes it is a behavior change but a behavior change that is fixing an 
issue.


was (Author: tgraves):
Note, I've never said there is no counter part scenarioNote and you read what I 
said in the pr you will see that:

| It doesn't hurt the common case, the common case is all your executors have 
tasks on them as long as there are tasks to run. Normally scheduler can fill up 
the executors. It will use more resources if the scheduler takes time to put 
tasks on them, but that versus the time wasted in jobs that don't have enough 
executors to run on is hard to quantify because its going to be so application 
dependent. yes it is a behavior change but a behavior change that is fixing an 
issue.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
>

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123439#comment-16123439
 ] 

Thomas Graves commented on SPARK-21656:
---

Note, I've never said there is no counter part scenarioNote and you read what I 
said in the pr you will see that:

| It doesn't hurt the common case, the common case is all your executors have 
tasks on them as long as there are tasks to run. Normally scheduler can fill up 
the executors. It will use more resources if the scheduler takes time to put 
tasks on them, but that versus the time wasted in jobs that don't have enough 
executors to run on is hard to quantify because its going to be so application 
dependent. yes it is a behavior change but a behavior change that is fixing an 
issue.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123435#comment-16123435
 ] 

Thomas Graves commented on SPARK-21656:
---

Yes there is a trade off here, use some more resource or have your job run time 
be really really slow and possibly deadlock. I completely understand the 
scenario where some executors may stay up when they aren't being used, if you 
have a better solution to do both please state it.  As I've stated changing 
config to me is a work around and not a solution. This case is handled by many 
other big data frameworks (pig, tez, mapreduce) and I believe spark should 
handle it as well.   

I would much rather lean towards having as many jobs run as fast as possible 
without the user having to tune things even at the expense of possibly using 
more resources.  I've describe 2 scenarios in which this problem can occur, 
there is also the extreme case where it goes to 0 that you keep mentioning. The 
fix provided is to address both of them.





> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and

[jira] [Updated] (SPARK-21710) ConsoleSink causes OOM crashes with large inputs.

2017-08-11 Thread Gerard Maas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Maas updated SPARK-21710:

Description: 
ConsoleSink does a full collect of the streaming dataset in order to show few 
lines on screen. This is problematic with large inputs, like a kafka backlog or 
a file source with files larger than the driver's memory.

Here's an example:

{code:java}
import spark.implicits._
import org.apache.spark.sql.functions
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._

val schema = StructType(StructField("text", StringType, true) :: Nil)

val lines = spark
  .readStream
  .format("text")
  .option("path", "/tmp/data")
  .schema(schema)
  .load()

val base = lines.writeStream
  .outputMode("append")
  .format("console")
  .start()
{code}

When a large file larger than the available driver memory is fed through this 
streaming job, we get:

{code:java}
---
Batch: 0
---

[Stage 0:>(0 + 8) / 
111]17/08/11 15:10:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3236)
  at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
  at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
  at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
  at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
  at java.io.DataOutputStream.write(DataOutputStream.java:107)
  at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
17/08/11 15:10:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker for task 6,5,main]
java.lang.OutOfMemoryError: Java heap space
{code}

This issue can be traced back to a `collect` on the source `DataFrame`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala#L52

A fairly simple solution would be to do a `take(numRows)` instead of the 
collect. (PR in progress)

  was:
ConsoleSink does a full collect of the streaming dataset in order to show few 
lines on screen. This is problematic with large inputs, like a kafka backlog or 
a file source with files larger than the driver's memory.

Here's an example:

{code:scala}
import spark.implicits._
import org.apache.spark.sql.functions
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._

val schema = StructType(StructField("text", StringType, true) :: Nil)

val lines = spark
  .readStream
  .format("text")
  .option("path", "/tmp/data")
  .schema(schema)
  .load()

val base = lines.writeStream
  .outputMode("append")
  .format("console")
  .start()
{code}

When a large file larger than the available driver memory is fed through this 
streaming job, we get:

{code:java}
---
Batch: 0
---

[Stage 0:>(0 + 8) / 
111]17/08/11 15:10:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3236)
  at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
  at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
  at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
  at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
  at java.io.DataOutputStream.write(DataOutputStream.java:107)
  at

[jira] [Created] (SPARK-21710) ConsoleSink causes OOM crashes with large inputs.

2017-08-11 Thread Gerard Maas (JIRA)

Gerard Maas created SPARK-21710:
---

 Summary: ConsoleSink causes OOM crashes with large inputs.
 Key: SPARK-21710
 URL: https://issues.apache.org/jira/browse/SPARK-21710
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
 Environment: affects all environments
Reporter: Gerard Maas


ConsoleSink does a full collect of the streaming dataset in order to show few 
lines on screen. This is problematic with large inputs, like a kafka backlog or 
a file source with files larger than the driver's memory.

Here's an example:

{code:scala}
import spark.implicits._
import org.apache.spark.sql.functions
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._

val schema = StructType(StructField("text", StringType, true) :: Nil)

val lines = spark
  .readStream
  .format("text")
  .option("path", "/tmp/data")
  .schema(schema)
  .load()

val base = lines.writeStream
  .outputMode("append")
  .format("console")
  .start()
{code}

When a large file larger than the available driver memory is fed through this 
streaming job, we get:

{code:java}
---
Batch: 0
---

[Stage 0:>(0 + 8) / 
111]17/08/11 15:10:45 ERROR Executor: Exception in task 6.0 in stage 0.0 (TID 6)
java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:3236)
  at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
  at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
  at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
  at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
  at java.io.DataOutputStream.write(DataOutputStream.java:107)
  at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.writeToStream(UnsafeRow.java:554)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:237)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:748)
17/08/11 15:10:45 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker for task 6,5,main]
java.lang.OutOfMemoryError: Java heap space
{code}

This issue can be traced back to a `collect` on the source `DataFrame`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala#L52

A fairly simple solution would be to do a `take(numRows)` instead of the 
collect. (PR in progress)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123411#comment-16123411
 ] 

Sean Owen commented on SPARK-21656:
---

I'm referring to the same issue you cite repeatedly, including:
https://github.com/apache/spark/pull/18874#issuecomment-321313616
https://issues.apache.org/jira/browse/SPARK-21656?focusedCommentId=16117200=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16117200
 

Something like a driver busy in long GC pauses doesn't keep up with the fact 
that executors are non-idle and removes them. Its conclusion is incorrect and 
that's what we're trying to fix. All the more because going to 0 executors 
stops the stage.

Right? I though we finally had it clear that this was the problem being fixed.

Now you're just describing a job that needs a lower locality timeout. (Or else, 
describing a different problem with different solution, as in 
https://github.com/apache/spark/pull/18874#issuecomment-321625808 -- why do 
they take so much longer than 3s to fall back to other executors?) That 
scenario is not a reason to make this change.

[~tgraves] please read 
https://github.com/apache/spark/pull/18874#issuecomment-321683515 . You're 
saying there's no counterpart scenario that is actually harmed by this change a 
bit, and I think there is. We need to get on the same page.



> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread Xu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

Xu Zhang edited comment on SPARK-21701 at 8/11/17 2:23 PM:
---

Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding SO_SNDBUF param (although we 
can set the value in OS, if no set, use default value), then peers may use a 
relatively small "sliding window" for later communication. Thus the param set 
in SparkConf does not take effect in transport phase as expected. To achieve 
consistency in both client and server side, enable client to get these params 
from SparkConf would make sense. Moreover, due to the fact that spark RPC 
module is not for high throughput and performant C/S service system, it should 
not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side deals with these params from SparkConf.

I have already created a PR for this update and some refinement work, for more 
details please visit https://github.com/apache/spark/pull/18922.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding SO_SNDBUF param (although we 
can set the value in OS, if no set, use default value), then peers may use a 
relatively small "sliding window" for later communication. Thus the param set 
in SparkConf does not take effect in transport phase as expected. To achieve 
consistency in both client and server side, enable client to get these params 
from SparkConf would make sense. Moreover, due to the fact that spark RPC 
module is not for high throughput and performant C/S service system, it should 
not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

I have already created a PR for this update and some refinement work, for more 
details please visit https://github.com/apache/spark/pull/18922.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread Xu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

Xu Zhang edited comment on SPARK-21701 at 8/11/17 2:22 PM:
---

Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding SO_SNDBUF param (although we 
can set the value in OS, if no set, use default value), then peers may use a 
relatively small "sliding window" for later communication. Thus the param set 
in SparkConf does not take effect in transport phase as expected. To achieve 
consistency in both client and server side, enable client to get these params 
from SparkConf would make sense. Moreover, due to the fact that spark RPC 
module is not for high throughput and performant C/S service system, it should 
not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

I have already created a PR for this update and some refinement work, for more 
details please visit https://github.com/apache/spark/pull/18922.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding SO_SNDBUF param (although we 
can set the value in OS, if no set, use default value), then peers may use a 
relatively small "sliding window" for later communication. Thus the param set 
in SparkConf does not take effect in transport phase as expected. To achieve 
consistency in both client and server side, enable client to get these params 
from SparkConf would make sense. Moreover, due to the fact that spark RPC 
module is not for high throughput and performant C/S service system, it should 
not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21708:
--
Priority: Minor  (was: Major)

Just wondering if this will cause end users to have to use SBT 1.0, or will 
just enable it? It's not out of the question as it only affects developers but 
worth keeping in mind.

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released.
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> Other related issues:
> SPARK-14401
> https://github.com/sbt/sbt/issues/3424
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21709) use sbt 0.13.16 and update sbt plugins

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21709:
--
Priority: Minor  (was: Major)

OK, not sure if I'd treat these separately, as we have no burning need to get 
onto 0.13.16. But please feel free to make it happen.

> use sbt 0.13.16 and update sbt plugins
> --
>
> Key: SPARK-21709
> URL: https://issues.apache.org/jira/browse/SPARK-21709
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> A preliminary step to SPARK-21708.
> Quite a lot of sbt plugin changes needed to get to full sbt 1.0.0 support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123393#comment-16123393
 ] 

Thomas Graves commented on SPARK-21656:
---

I don't know what you mean by busy driver.  The example of the tests results is 
showing this is fixing the issue.  The issue is as I've describe in the 
description of the jira.  In this case its due to the scheduler and the fact it 
doesn't immediately use the executors due to the locality settings, as long as 
you keep those executors around (don't idle timeout them) they do get used and 
it has a huge impact on the run time.  the executors only eventually get tasks 
because of the scheduler locality delay.  

I don't know what you mean by the flip-side of the situation and how this gets 
worse.

If you want something to compare to go see how other frameworks due this same 
thing. TEZ for instance. This fix is changing it so it acts very similar to 
those.



> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't

[jira] [Updated] (SPARK-21709) use sbt 0.13.16 and update sbt plugins

2017-08-11 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21709:
---
Description: 
A preliminary step to SPARK-21708.
Quite a lot of sbt plugin changes needed to get to full sbt 1.0.0 support.


  was:
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/typesafehub/sbteclipse/issues/343
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14



> use sbt 0.13.16 and update sbt plugins
> --
>
> Key: SPARK-21709
> URL: https://issues.apache.org/jira/browse/SPARK-21709
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>
> A preliminary step to SPARK-21708.
> Quite a lot of sbt plugin changes needed to get to full sbt 1.0.0 support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21709) use sbt 0.13.16 and update sbt plugins

2017-08-11 Thread PJ Fanning (JIRA)

PJ Fanning created SPARK-21709:
--

 Summary: use sbt 0.13.16 and update sbt plugins
 Key: SPARK-21709
 URL: https://issues.apache.org/jira/browse/SPARK-21709
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.0
Reporter: PJ Fanning


I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/typesafehub/sbteclipse/issues/343
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21708:
---
Description: 
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/typesafehub/sbteclipse/issues/343
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14


  was:
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14



> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released.
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> Other related issues:
> SPARK-14401
> https://github.com/sbt/sbt/issues/3424
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21675) Add a navigation bar at the bottom of the Details for Stage Page

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21675:
-

Assignee: Kent Yao

> Add a navigation bar at the bottom of the Details for Stage Page
> 
>
> Key: SPARK-21675
> URL: https://issues.apache.org/jira/browse/SPARK-21675
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> 1. In Spark Web UI, the Details for Stage Page don't have a navigation bar at 
> the bottom. When we drop down to the bottom, it is better for us to see a 
> navi bar right there to go whereever we what.
> 2. Executor ID does not equal to Host, we may separate them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21675) Add a navigation bar at the bottom of the Details for Stage Page

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21675.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18893
[https://github.com/apache/spark/pull/18893]

> Add a navigation bar at the bottom of the Details for Stage Page
> 
>
> Key: SPARK-21675
> URL: https://issues.apache.org/jira/browse/SPARK-21675
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> 1. In Spark Web UI, the Details for Stage Page don't have a navigation bar at 
> the bottom. When we drop down to the bottom, it is better for us to see a 
> navi bar right there to go whereever we what.
> 2. Executor ID does not equal to Host, we may separate them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123373#comment-16123373
 ] 

Sean Owen commented on SPARK-21656:
---

Is this the 'busy driver' scenario that the PR contemplates? If not, then this 
may be true, but it's not the motivation of the PR, right? this is just a case 
where you need shorter locality timeout, or something. It's also not the 
0-executor scenario that is the motivation of the PR either.

If this is the 'busy driver' scenario, then I also wonder what happens if you 
increase the locality timeout. That was one unfinished thread in the PR 
discussion; why do the other executors get tasks only so very eventually?

I want to stay clear on what we're helping here, and also what the cost is: see 
the flip-side to this situation described in the PR, which could get worse.

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-21686) spark.sql.hive.convertMetastoreOrc is causing NullPointerException while reading ORC tables

2017-08-11 Thread Ernani Pereira de Mattos Junior (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123361#comment-16123361
 ] 

Ernani Pereira de Mattos Junior commented on SPARK-21686:
-

Hello [~viirya]
I cannot confirm nor deny; Disabling the spark.sql.hive.convertMetastoreOrc as 
an acceptable workaround, and I opened this Jira to document my experience for 
future references. 

Regards

> spark.sql.hive.convertMetastoreOrc is causing NullPointerException while 
> reading ORC tables
> ---
>
> Key: SPARK-21686
> URL: https://issues.apache.org/jira/browse/SPARK-21686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: spark_2_4_2_0_258-1.6.1.2.4.2.0-258.el6.noarch 
> spark_2_4_2_0_258-python-1.6.1.2.4.2.0-258.el6.noarch 
> spark_2_4_2_0_258-yarn-shuffle-1.6.1.2.4.2.0-258.el6.noarch
> RHEL-7 (64-Bit)
> JDK 1.8
>Reporter: Ernani Pereira de Mattos Junior
>
> The issue is very similar to SPARK-10304; 
> Spark Query throws a NullPointerException. 
> >>> sqlContext.sql('select * from core_next.spark_categorization').show(57) 
> 17/06/19 11:26:54 ERROR Executor: Exception in task 2.0 in stage 21.0 (TID 
> 48) 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:488)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:244)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275)
>  
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$6.apply(OrcRelation.scala:275)
>  
> Turn off ORC optimizations and issue was resolved: 
> "sqlContext.setConf("spark.sql.hive.convertMetastoreOrc", "false")



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21656) spark dynamic allocation should not idle timeout executors when there are enough tasks to run on them

2017-08-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123358#comment-16123358
 ] 

Thomas Graves commented on SPARK-21656:
---

example of test results with this.

We have production job running 21600 tasks.  With default locality the job 
takes 3.1 hours due to this issue. With the fix proposed in the pull request 
the job takes 17 minutes.  The resource utilization of the fix does use more 
resource but every executor eventually has multiple tasks run on it, 
demonstrating that if we hold on to them for a while the scheduler will fall 
back and use them. 

> spark dynamic allocation should not idle timeout executors when there are 
> enough tasks to run on them
> -
>
> Key: SPARK-21656
> URL: https://issues.apache.org/jira/browse/SPARK-21656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Right now with dynamic allocation spark starts by getting the number of 
> executors it needs to run all the tasks in parallel (or the configured 
> maximum) for that stage.  After it gets that number it will never reacquire 
> more unless either an executor dies, is explicitly killed by yarn or it goes 
> to the next stage.  The dynamic allocation manager has the concept of idle 
> timeout. Currently this says if a task hasn't been scheduled on that executor 
> for a configurable amount of time (60 seconds by default), then let that 
> executor go.  Note when it lets that executor go due to the idle timeout it 
> never goes back to see if it should reacquire more.
> This is a problem for multiple reasons:
> 1 . Things can happen in the system that are not expected that can cause 
> delays. Spark should be resilient to these. If the driver is GC'ing, you have 
> network delays, etc we could idle timeout executors even though there are 
> tasks to run on them its just the scheduler hasn't had time to start those 
> tasks.  Note that in the worst case this allows the number of executors to go 
> to 0 and we have a deadlock.
> 2. Internal Spark components have opposing requirements. The scheduler has a 
> requirement to try to get locality, the dynamic allocation doesn't know about 
> this and if it lets the executors go it hurts the scheduler from doing what 
> it was designed to do.  For example the scheduler first tries to schedule 
> node local, during this time it can skip scheduling on some executors.  After 
> a while though the scheduler falls back from node local to scheduler on rack 
> local, and then eventually on any node.  So during when the scheduler is 
> doing node local scheduling, the other executors can idle timeout.  This 
> means that when the scheduler does fall back to rack or any locality where it 
> would have used those executors, we have already let them go and it can't 
> scheduler all the tasks it could which can have a huge negative impact on job 
> run time.
>  
> In both of these cases when the executors idle timeout we never go back to 
> check to see if we need more executors (until the next stage starts).  In the 
> worst case you end up with 0 and deadlock, but generally this shows itself by 
> just going down to very few executors when you could have 10's of thousands 
> of tasks to run on them, which causes the job to take way more time (in my 
> case I've seen it should take minutes and it takes hours due to only been 
> left a few executors).  
> We should handle these situations in Spark.   The most straight forward 
> approach would be to not allow the executors to idle timeout when there are 
> tasks that could run on those executors. This would allow the scheduler to do 
> its job with locality scheduling.  In doing this it also fixes number 1 above 
> because you never can go into a deadlock as it will keep enough executors to 
> run all the tasks on. 
> There are other approaches to fix this, like explicitly prevent it from going 
> to 0 executors, that prevents a deadlock but can still cause the job to 
> slowdown greatly.  We could also change it at some point to just re-check to 
> see if we should get more executors, but this adds extra logic, we would have 
> to decide when to check, its also just overhead in letting them go and then 
> re-acquiring them again and this would cause some slowdown in the job as the 
> executors aren't immediately there for the scheduler to place things on. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11574) Spark should support StatsD sink out of box

2017-08-11 Thread Na Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123344#comment-16123344
 ] 

Na Zhao commented on SPARK-11574:
-

When can I expect this feature to release?

> Spark should support StatsD sink out of box
> ---
>
> Key: SPARK-11574
> URL: https://issues.apache.org/jira/browse/SPARK-11574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Xiaofeng Lin
>
> In order to run spark in production, monitoring is essential. StatsD is such 
> a common metric reporting mechanism that it should be supported out of the 
> box.  This will enable publishing metrics to monitoring services like 
> datadog, etc. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21708:
---
Description: 
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

Other related issues:
SPARK-14401
https://github.com/sbt/sbt/issues/3424
https://github.com/jrudolph/sbt-dependency-graph/issues/134
https://github.com/AlpineNow/junit_xml_listener/issues/6
https://github.com/spray/sbt-revolver/issues/62
https://github.com/ihji/sbt-antlr4/issues/14


  was:
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released. https://github.com/sbt/sbt/issues/3424
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html


> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released.
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> Other related issues:
> SPARK-14401
> https://github.com/sbt/sbt/issues/3424
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14401) Switch to stock sbt-pom-reader plugin

2017-08-11 Thread PJ Fanning (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123319#comment-16123319
 ] 

PJ Fanning commented on SPARK-14401:


This would be useful for a general upgrade to sbt 1.0.0

> Switch to stock sbt-pom-reader plugin
> -
>
> Key: SPARK-14401
> URL: https://issues.apache.org/jira/browse/SPARK-14401
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> Spark currently depends on a forked version of {{sbt-pom-reader}} which we 
> build from source. It would be great to port our modifications to the 
> upstream project so that we can migrate to the official version and stop 
> maintaining our fork.
> [~scrapco...@gmail.com], could you edit this ticket to fill in more detail 
> about which custom changes have not been ported yet?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21708:
---
Description: 
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released. https://github.com/sbt/sbt/issues/3424
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html

  was:
I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html


> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>
> I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
> released. https://github.com/sbt/sbt/issues/3424
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21708) use sbt 1.0.0

2017-08-11 Thread PJ Fanning (JIRA)

PJ Fanning created SPARK-21708:
--

 Summary: use sbt 1.0.0
 Key: SPARK-21708
 URL: https://issues.apache.org/jira/browse/SPARK-21708
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.3.0
Reporter: PJ Fanning


I had a quick look and I think we'll need to wait until sbt-launch 1.0 jar is 
released.
Should improve sbt build times.
http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-11 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123208#comment-16123208
 ] 

Steve Loughran commented on SPARK-21697:


What would a test to replicate look like?

# Create MiniDFS cluster (better: find a test with one already spun up @ class 
level)
# copy JAR to it (issue: can't rely on the local test suite being in a JAR due 
to SBT & IDEs doing things differently/faster than maven)
# add JAR to CP/create a CP which *only* has the JAR in
# Load something from the CP which triggers download. 

If you can assume that some common library (junit.jar?) is always in a JAR then 
the JAR could be uploaded by: locating its URL, translate to local path & then 
use FileSystm.copyFromLocalFile() to upload.

Or: create/find a UDF JAR, copy to MiniDFSCluster, start spark SQL with the 
HDFS URL. This would verify the desired codepath & be best to make sure its 
gone away

> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-11 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123193#comment-16123193
 ] 

Steve Loughran commented on SPARK-21697:


PS: right now, probably doesn't work at all

> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-08-11 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123192#comment-16123192
 ] 

Steve Loughran commented on SPARK-12868:


SPARK-21697: harder than it would initially seem

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21707) Improvement a special case for non-deterministic filters in optimizer

2017-08-11 Thread caoxuewen (JIRA)

caoxuewen created SPARK-21707:
-

 Summary: Improvement a special case for non-deterministic filters 
in optimizer
 Key: SPARK-21707
 URL: https://issues.apache.org/jira/browse/SPARK-21707
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic filters. Deal with that we only need to read user needs 
fields for non-deterministic filters in optimizer.
For example, the condition of filters is nondeterministic. e.g:contains 
nondeterministic function(rand function), HiveTableScans optimizer generated:

```
HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as bigint)) AS 
sum(id)#395L]
+- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
   +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && 
NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
  +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && NOT 
(cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
   +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) 
<= 0.5)) && NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:MetastoreRelation XXX_database, XXX_table

```
so HiveTableScan will read all the fields from table. but we only need to 
‘d004’ and 'c010' . it will affect the performance of task.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread Xu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

Xu Zhang edited comment on SPARK-21701 at 8/11/17 9:53 AM:
---

Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding SO_SNDBUF param (although we 
can set the value in OS, if no set, use default value), then peers may use a 
relatively small "sliding window" for later communication. Thus the param set 
in SparkConf does not take effect in transport phase as expected. To achieve 
consistency in both client and server side, enable client to get these params 
from SparkConf would make sense. Moreover, due to the fact that spark RPC 
module is not for high throughput and performant C/S service system, it should 
not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side does not take effect in transport phase as expected. 
To achieve consistency in both client and server side, enable client to get 
these params from SparkConf would make sense. Moreover, due to the fact that 
spark RPC module is not for high throughput and performant C/S service system, 
it should not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread Xu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

Xu Zhang edited comment on SPARK-21701 at 8/11/17 9:50 AM:
---

Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side does not take effect in transport phase as expected. 
To achieve consistency in both client and server side, enable client to get 
these params from SparkConf would make sense. Moreover, due to the fact that 
spark RPC module is not for high throughput and performant C/S service system, 
it should not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark 
RPC module is not for high throughput and performant C/S service system, it 
should not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread neoremind (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

neoremind edited comment on SPARK-21701 at 8/11/17 9:46 AM:


Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark 
RPC module is not for high throughput and performant C/S service system, it 
should not to be a big problem, so I set the ticket to improvement label.

In a word, my point is it would be better to keep an entrance to the outside 
world to set these params in client side and keep consistent with how server 
side specifies these params.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark is 
not high throughput and performant C/S service system, it should not to be a 
big problem, so I set the ticket to improvement label and lower priority.

In a word, my point is to keep an entrance to the outside world to set these 
params in client side and keep consistent with how server side specifies these 
params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread neoremind (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

neoremind edited comment on SPARK-21701 at 8/11/17 9:44 AM:


Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark is 
not high throughput and performant C/S service system, it should not to be a 
big problem, so I set the ticket to improvement label and lower priority.

In a word, my point is to keep an entrance to the outside world to set these 
params in client side and keep consistent with how server side specifies these 
params.

Thanks


was (Author: xu.zhang):
Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark is 
not high throughput and performant C/S service system, it should not to be a 
big problem, so I set the ticket to improvement label and lower priority.

In a word, my point is to keep a entrance to the outside world to set these 
params in client side and keep consistent with how server side specifies these 
params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread neoremind (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123096#comment-16123096
 ] 

neoremind commented on SPARK-21701:
---

Hi Sean, 
Thanks for your quick response. SO_RCVBUF and SO_SNDBUF are for TCP, and in 
server side, the two parameters are specified in
{code:java}
org.apache.spark.network.server.TransportServer
{code}
through SparkConf. If server side specify a bigger SO_RCVBUF size, while client 
does not have any place to set the corresponding param (although we can set the 
value in OS, if no set, use default value), then peers may set a relatively 
small sliding window size for later communication. Thus the param set in 
SparkConf in server side is useless and does not take effect as expected. To 
achieve consistency in both client and server side, enable client to get these 
params from SparkConf would make sense. Moreover, due to the fact that spark is 
not high throughput and performant C/S service system, it should not to be a 
big problem, so I set the ticket to improvement label and lower priority.

In a word, my point is to keep a entrance to the outside world to set these 
params in client side and keep consistent with how server side specifies these 
params.

Thanks

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-11 Thread neoremind (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123071#comment-16123071
 ] 

neoremind edited comment on SPARK-21703 at 8/11/17 9:18 AM:


Thanks Sean to guide me to the right place. Here's the 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-CORE-Why-RPC-message-are-transferred-with-header-and-body-separately-in-TCP-frame-td29054.html.


was (Author: xu.zhang):
Thanks Sean to guide me to the right place. Here's the 
[link](http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-CORE-Why-RPC-message-are-transferred-with-header-and-body-separately-in-TCP-frame-td29054.html).

> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could someone help me understand the background story on why to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-11 Thread neoremind (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123071#comment-16123071
 ] 

neoremind commented on SPARK-21703:
---

Thanks Sean to guide me to the right place. Here's the 
[link](http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-CORE-Why-RPC-message-are-transferred-with-header-and-body-separately-in-TCP-frame-td29054.html).

> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could someone help me understand the background story on why to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS

2017-08-11 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123063#comment-16123063
 ] 

Steve Loughran commented on SPARK-21697:


# I don't see anything which can be done in HDFS here; it's in the libraries 
below it.
# Recent Hadoop releases => SLF4J, which *may* not have this problem. But as 
log4j looks for log4j.properties, and as dependent libraries may use 
commons-logging, there's no guarantee of that.

What to do?
# Classloader games: bring up the log infra then add the HDFS JARs to the CP. 
Maybe requires knowledge of what to force in before anything else. e.g: using 
new CP, do a stat of every JAR path, then inject them into the CP. Risky, as 
nobody really understands classpaths.
# Force D/L the remote artifact to local temp FS before execution, as YARN does 
itself. Do it for HFDS, WASB, S3x, ..., all filesystems known by Hadoop FS. 
(side issue, is there a way to enumerate this? Probably not, except for merging 
the list of service-discovered entries and those with an {{fs.SCHEMA.imp}} 
entry. 

I think that' #2 is potentially the simplest and so most viable. It's not quite 
as elegant as saying "this is a supported URL you can directly use in the CP", 
but its the one that is going to avoid these problems




> NPE & ExceptionInInitializerError trying to load UTF from HDFS
> --
>
> Key: SPARK-21697
> URL: https://issues.apache.org/jira/browse/SPARK-21697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Spark Client mode, Hadoop 2.6.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Reported on [the 
> PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for 
> SPARK-12868: trying to load a UDF of HDFS is triggering an 
> {{ExceptionInInitializerError}}, caused by an NPE which should only happen if 
> the commons-logging {{LOG}} log is null.
> Hypothesis: the commons logging scan for {{commons-logging.properties}} is 
> happening in the classpath with the HDFS JAR; this is triggering a D/L of the 
> JAR, which needs to force in commons-logging, and, as that's not inited yet, 
> NPEs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer

2017-08-11 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects. Deal with that we only need to read user needs 
fields for non-deterministic projects in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.


  was:
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.



> Improvement a special case for non-deterministic projects in optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects. Deal with that we only need to read user 
> needs fields for non-deterministic projects in optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer

2017-08-11 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Summary: Improvement a special case for non-deterministic projects in 
optimizer  (was: Improvement a special case for non-deterministic projects and 
filters in optimizer)

> Improvement a special case for non-deterministic projects in optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123050#comment-16123050
 ] 

Liang-Chi Hsieh edited comment on SPARK-21657 at 8/11/17 8:55 AM:
--

Maybe not very related to this issue. But I'm exploring Generate related code 
to get hint about this issue. I'm curious why we still don't enable codegen of 
Generate for now. [~hvanhovell] Maybe you know why it is disabled? Thanks.


was (Author: viirya):
Maybe not very related to this issue. But I'm exploring Generate related code. 
I'm curious why we still don't enable codegen of Generate for now. 
[~hvanhovell] Maybe you know why it is disabled? Thanks.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123050#comment-16123050
 ] 

Liang-Chi Hsieh commented on SPARK-21657:
-

Maybe not very related to this issue. But I'm exploring Generate related code. 
I'm curious why we still don't enable codegen of Generate for now. 
[~hvanhovell] Maybe you know why it is disabled? Thanks.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sizes nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling 50,000 it took 7 hours to explode the nested collections (\!) of 
> 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-08-11 Thread srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123029#comment-16123029
 ] 

srinivasan edited comment on SPARK-19372 at 8/11/17 8:29 AM:
-

Hi [~kiszk], the fix does not work in 2.2.0 for 

select * from temp where not( Field1 = '' and  Field2 = '' and  Field3 = '' and 
 Field4 = '' and  Field5 = '' and  BLANK_5 = '' and  Field7 = '' and  Field8 = 
'' and  Field9 = '' and  Field10 = '' and  Field11 = '' and  Field12 = '' and  
Field13 = '' and  Field14 = '' and  Field15 = '' and  Field16 = '' and  Field17 
= '' and  Field18 = '' and  Field19 = '' and  Field20 = '' and  Field21 = '' 
and  Field22 = '' and  Field23 = '' and  Field24 = '' and  Field25 = '' and  
Field26 = '' and  Field27 = '' and  Field28 = '' and  Field29 = '' and  Field30 
= '' and  Field31 = '' and  Field32 = '' and  Field33 = '' and  Field34 = '' 
and  Field35 = '' and  Field36 = '' and  Field37 = '' and  Field38 = '' and  
Field39 = '' and  Field40 = '' and  Field41 = '' and  Field42 = '' and  Field43 
= '' and  Field44 = '' and  Field45 = '' and  Field46 = '' and  Field47 = '' 
and  Field48 = '' and  Field49 = '' and  Field50 = '' and  Field51 = '' and  
Field52 = '' and  Field53 = '' and  Field54 = '' and  Field55 = '' and  Field56 
= '' and  Field57 = '' and  Field58 = '' and  Field59 = '' and  Field60 = '' 
and  Field61 = '' and  Field62 = '' and  Field63 = '' and  Field64 = '' and  
Field65 = '' and  Field66 = '' and  Field67 = '' and  Field68 = '' and  Field69 
= '' and  Field70 = '' and  Field71 = '' and  Field72 = '' and  Field73 = '' 
and  Field74 = '' and  Field75 = '' and  Field76 = '' and  Field77 = '' and  
Field78 = '' and  Field79 = '' and  Field80 = '' and  Field81 = '' and  Field82 
= '' and  Field83 = '' and  Field84 = '' and  Field85 = '' and  Field86 = '' 
and  Field87 = '' and  Field88 = '' and  Field89 = '' and  Field90 = '' and  
Field91 = '' and  Field92 = '' and  Field93 = '' and  Field94 = '' and  Field95 
= '' and  Field96 = '' and  Field97 = '' and  Field98 = '' and  Field99 = '' 
and  Field100 = '' and  Field101 = '' and  Field102 = '' and  Field103 = '' and 
 Field104 = '' and  Field105 = '' and  Field106 = '' and  Field107 = '' and  
Field108 = '' and  Field109 = '' and  Field110 = '' and  Field111 = '' and  
Field112 = '' and  Field113 = '' and  Field114 = '' and  Field115 = '' and  
Field116 = '' and  Field117 = '' and  Field118 = '' and  Field119 = '' and  
Field120 = '' and  Field121 = '' and  Field122 = '' and  Field123 = '' and  
Field124 = '' and  Field125 = '' and  Field126 = '' and  Field127 = '' and  
Field128 = '' and  Field129 = '' and  Field130 = '' and  Field131 = '' and  
Field132 = '' and  Field133 = '' and  Field134 = '' and  Field135 = '' and  
Field136 = '' and  Field137 = '' and  Field138 = '' and  Field139 = '' and  
Field140 = '' and  Field141 = '' and  Field142 = '' and  Field143 = '' and  
Field144 = '' and  Field145 = '' and  Field146 = '' and  Field147 = '' and  
Field148 = '' and  Field149 = '' and  Field150 = '' and  Field151 = '' and  
Field152 = '' and  Field153 = '' and  Field154 = '' and  Field155 = '' and  
Field156 = '' and  Field157 = '' and  Field158 = '' and  Field159 = '' and  
Field160 = '' and  Field161 = '' and  Field162 = '' and  Field163 = '' and  
Field164 = '' and  Field165 = '' and  Field166 = '' and  Field167 = '' and  
Field168 = '' and  Field169 = '' and  Field170 = '' and  Field171 = '' and  
Field172 = '' and  Field173 = '' and  Field174 = '' and  Field175 = '' and  
Field176 = '' and  Field177 = '' and  Field178 = '' and  Field179 = '' and  
Field180 = '' and  Field181 = '' and  Field182 = '' and  Field183 = '' and  
Field184 = '' and  Field185 = '' and  Field186 = '' and  Field187 = '' and  
Field188 = '' and  Field189 = '' and  Field190 = '' and  Field191 = '' and  
Field192 = '' and  Field193 = '' and  Field194 = '' and  Field195 = '' and  
Field196 = '' and  Field197 = '' and  Field198 = '' and  Field199 = '' and  
Field200 = '' and  Field201 = '' and  Field202 = '' and  Field203 = '' and  
Field204 = '' and  Field205 = '' and  Field206 = '' and  Field207 = '' and  
Field208 = '' and  Field209 = '' and  Field210 = '' and  Field211 = '' and  
Field212 = '' and  Field213 = '' and  Field214 = '' and  Field215 = '' and  
Field216 = '' and  Field217 = '' and  Field218 = '' and  Field219 = '' and  
Field220 = '' and  Field221 = '' and  Field222 = '' and  Field223 = '' and  
Field224 = '' and  Field225 = '' and  Field226 = '' and  Field227 = '' and  
Field228 = '' and  Field229 = '' and  Field230 = '' and  Field231 = '' and  
Field232 = '' and  Field233 = '' and  Field234 = '' and  Field235 = '' and  
Field236 = '' and  Field237 = '' and  Field238 = '' and  Field239 = '' and  
Field240 = '' and  Field241 = '' and  Field242 = '' and  Field243 = '' and  
Field244 = '' and  Field245 = '' and  Field246 = '' and  Field247

[jira] [Comment Edited] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-08-11 Thread srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123029#comment-16123029
 ] 

srinivasan edited comment on SPARK-19372 at 8/11/17 8:28 AM:
-

Hi [~kiszk], the fix does not work in 2.2.0 for 

select * from temp where not( Field1 = '' and  Field2 = '' and  Field3 = '' and 
 Field4 = '' and  Field5 = '' and  BLANK_5 = '' and  Field7 = '' and  Field8 = 
'' and  Field9 = '' and  Field10 = '' and  Field11 = '' and  Field12 = '' and  
Field13 = '' and  Field14 = '' and  Field15 = '' and  Field16 = '' and  Field17 
= '' and  Field18 = '' and  Field19 = '' and  Field20 = '' and  Field21 = '' 
and  Field22 = '' and  Field23 = '' and  Field24 = '' and  Field25 = '' and  
Field26 = '' and  Field27 = '' and  Field28 = '' and  Field29 = '' and  Field30 
= '' and  Field31 = '' and  Field32 = '' and  Field33 = '' and  Field34 = '' 
and  Field35 = '' and  Field36 = '' and  Field37 = '' and  Field38 = '' and  
Field39 = '' and  Field40 = '' and  Field41 = '' and  Field42 = '' and  Field43 
= '' and  Field44 = '' and  Field45 = '' and  Field46 = '' and  Field47 = '' 
and  Field48 = '' and  Field49 = '' and  Field50 = '' and  Field51 = '' and  
Field52 = '' and  Field53 = '' and  Field54 = '' and  Field55 = '' and  Field56 
= '' and  Field57 = '' and  Field58 = '' and  Field59 = '' and  Field60 = '' 
and  Field61 = '' and  Field62 = '' and  Field63 = '' and  Field64 = '' and  
Field65 = '' and  Field66 = '' and  Field67 = '' and  Field68 = '' and  Field69 
= '' and  Field70 = '' and  Field71 = '' and  Field72 = '' and  Field73 = '' 
and  Field74 = '' and  Field75 = '' and  Field76 = '' and  Field77 = '' and  
Field78 = '' and  Field79 = '' and  Field80 = '' and  Field81 = '' and  Field82 
= '' and  Field83 = '' and  Field84 = '' and  Field85 = '' and  Field86 = '' 
and  Field87 = '' and  Field88 = '' and  Field89 = '' and  Field90 = '' and  
Field91 = '' and  Field92 = '' and  Field93 = '' and  Field94 = '' and  Field95 
= '' and  Field96 = '' and  Field97 = '' and  Field98 = '' and  Field99 = '' 
and  Field100 = '' and  Field101 = '' and  Field102 = '' and  Field103 = '' and 
 Field104 = '' and  Field105 = '' and  Field106 = '' and  Field107 = '' and  
Field108 = '' and  Field109 = '' and  Field110 = '' and  Field111 = '' and  
Field112 = '' and  Field113 = '' and  Field114 = '' and  Field115 = '' and  
Field116 = '' and  Field117 = '' and  Field118 = '' and  Field119 = '' and  
Field120 = '' and  Field121 = '' and  Field122 = '' and  Field123 = '' and  
Field124 = '' and  Field125 = '' and  Field126 = '' and  Field127 = '' and  
Field128 = '' and  Field129 = '' and  Field130 = '' and  Field131 = '' and  
Field132 = '' and  Field133 = '' and  Field134 = '' and  Field135 = '' and  
Field136 = '' and  Field137 = '' and  Field138 = '' and  Field139 = '' and  
Field140 = '' and  Field141 = '' and  Field142 = '' and  Field143 = '' and  
Field144 = '' and  Field145 = '' and  Field146 = '' and  Field147 = '' and  
Field148 = '' and  Field149 = '' and  Field150 = '' and  Field151 = '' and  
Field152 = '' and  Field153 = '' and  Field154 = '' and  Field155 = '' and  
Field156 = '' and  Field157 = '' and  Field158 = '' and  Field159 = '' and  
Field160 = '' and  Field161 = '' and  Field162 = '' and  Field163 = '' and  
Field164 = '' and  Field165 = '' and  Field166 = '' and  Field167 = '' and  
Field168 = '' and  Field169 = '' and  Field170 = '' and  Field171 = '' and  
Field172 = '' and  Field173 = '' and  Field174 = '' and  Field175 = '' and  
Field176 = '' and  Field177 = '' and  Field178 = '' and  Field179 = '' and  
Field180 = '' and  Field181 = '' and  Field182 = '' and  Field183 = '' and  
Field184 = '' and  Field185 = '' and  Field186 = '' and  Field187 = '' and  
Field188 = '' and  Field189 = '' and  Field190 = '' and  Field191 = '' and  
Field192 = '' and  Field193 = '' and  Field194 = '' and  Field195 = '' and  
Field196 = '' and  Field197 = '' and  Field198 = '' and  Field199 = '' and  
Field200 = '' and  Field201 = '' and  Field202 = '' and  Field203 = '' and  
Field204 = '' and  Field205 = '' and  Field206 = '' and  Field207 = '' and  
Field208 = '' and  Field209 = '' and  Field210 = '' and  Field211 = '' and  
Field212 = '' and  Field213 = '' and  Field214 = '' and  Field215 = '' and  
Field216 = '' and  Field217 = '' and  Field218 = '' and  Field219 = '' and  
Field220 = '' and  Field221 = '' and  Field222 = '' and  Field223 = '' and  
Field224 = '' and  Field225 = '' and  Field226 = '' and  Field227 = '' and  
Field228 = '' and  Field229 = '' and  Field230 = '' and  Field231 = '' and  
Field232 = '' and  Field233 = '' and  Field234 = '' and  Field235 = '' and  
Field236 = '' and  Field237 = '' and  Field238 = '' and  Field239 = '' and  
Field240 = '' and  Field241 = '' and  Field242 = '' and  Field243 = '' and  
Field244 = '' and  Field245 = '' and  Field246 = '' and  Field247

[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-08-11 Thread srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123029#comment-16123029
 ] 

srinivasan commented on SPARK-19372:


Hi [~kiszk], the fix does not work for 
select * from temp where not( Field1 = '' and  Field2 = '' and  Field3 = '' and 
 Field4 = '' and  Field5 = '' and  BLANK_5 = '' and  Field7 = '' and  Field8 = 
'' and  Field9 = '' and  Field10 = '' and  Field11 = '' and  Field12 = '' and  
Field13 = '' and  Field14 = '' and  Field15 = '' and  Field16 = '' and  Field17 
= '' and  Field18 = '' and  Field19 = '' and  Field20 = '' and  Field21 = '' 
and  Field22 = '' and  Field23 = '' and  Field24 = '' and  Field25 = '' and  
Field26 = '' and  Field27 = '' and  Field28 = '' and  Field29 = '' and  Field30 
= '' and  Field31 = '' and  Field32 = '' and  Field33 = '' and  Field34 = '' 
and  Field35 = '' and  Field36 = '' and  Field37 = '' and  Field38 = '' and  
Field39 = '' and  Field40 = '' and  Field41 = '' and  Field42 = '' and  Field43 
= '' and  Field44 = '' and  Field45 = '' and  Field46 = '' and  Field47 = '' 
and  Field48 = '' and  Field49 = '' and  Field50 = '' and  Field51 = '' and  
Field52 = '' and  Field53 = '' and  Field54 = '' and  Field55 = '' and  Field56 
= '' and  Field57 = '' and  Field58 = '' and  Field59 = '' and  Field60 = '' 
and  Field61 = '' and  Field62 = '' and  Field63 = '' and  Field64 = '' and  
Field65 = '' and  Field66 = '' and  Field67 = '' and  Field68 = '' and  Field69 
= '' and  Field70 = '' and  Field71 = '' and  Field72 = '' and  Field73 = '' 
and  Field74 = '' and  Field75 = '' and  Field76 = '' and  Field77 = '' and  
Field78 = '' and  Field79 = '' and  Field80 = '' and  Field81 = '' and  Field82 
= '' and  Field83 = '' and  Field84 = '' and  Field85 = '' and  Field86 = '' 
and  Field87 = '' and  Field88 = '' and  Field89 = '' and  Field90 = '' and  
Field91 = '' and  Field92 = '' and  Field93 = '' and  Field94 = '' and  Field95 
= '' and  Field96 = '' and  Field97 = '' and  Field98 = '' and  Field99 = '' 
and  Field100 = '' and  Field101 = '' and  Field102 = '' and  Field103 = '' and 
 Field104 = '' and  Field105 = '' and  Field106 = '' and  Field107 = '' and  
Field108 = '' and  Field109 = '' and  Field110 = '' and  Field111 = '' and  
Field112 = '' and  Field113 = '' and  Field114 = '' and  Field115 = '' and  
Field116 = '' and  Field117 = '' and  Field118 = '' and  Field119 = '' and  
Field120 = '' and  Field121 = '' and  Field122 = '' and  Field123 = '' and  
Field124 = '' and  Field125 = '' and  Field126 = '' and  Field127 = '' and  
Field128 = '' and  Field129 = '' and  Field130 = '' and  Field131 = '' and  
Field132 = '' and  Field133 = '' and  Field134 = '' and  Field135 = '' and  
Field136 = '' and  Field137 = '' and  Field138 = '' and  Field139 = '' and  
Field140 = '' and  Field141 = '' and  Field142 = '' and  Field143 = '' and  
Field144 = '' and  Field145 = '' and  Field146 = '' and  Field147 = '' and  
Field148 = '' and  Field149 = '' and  Field150 = '' and  Field151 = '' and  
Field152 = '' and  Field153 = '' and  Field154 = '' and  Field155 = '' and  
Field156 = '' and  Field157 = '' and  Field158 = '' and  Field159 = '' and  
Field160 = '' and  Field161 = '' and  Field162 = '' and  Field163 = '' and  
Field164 = '' and  Field165 = '' and  Field166 = '' and  Field167 = '' and  
Field168 = '' and  Field169 = '' and  Field170 = '' and  Field171 = '' and  
Field172 = '' and  Field173 = '' and  Field174 = '' and  Field175 = '' and  
Field176 = '' and  Field177 = '' and  Field178 = '' and  Field179 = '' and  
Field180 = '' and  Field181 = '' and  Field182 = '' and  Field183 = '' and  
Field184 = '' and  Field185 = '' and  Field186 = '' and  Field187 = '' and  
Field188 = '' and  Field189 = '' and  Field190 = '' and  Field191 = '' and  
Field192 = '' and  Field193 = '' and  Field194 = '' and  Field195 = '' and  
Field196 = '' and  Field197 = '' and  Field198 = '' and  Field199 = '' and  
Field200 = '' and  Field201 = '' and  Field202 = '' and  Field203 = '' and  
Field204 = '' and  Field205 = '' and  Field206 = '' and  Field207 = '' and  
Field208 = '' and  Field209 = '' and  Field210 = '' and  Field211 = '' and  
Field212 = '' and  Field213 = '' and  Field214 = '' and  Field215 = '' and  
Field216 = '' and  Field217 = '' and  Field218 = '' and  Field219 = '' and  
Field220 = '' and  Field221 = '' and  Field222 = '' and  Field223 = '' and  
Field224 = '' and  Field225 = '' and  Field226 = '' and  Field227 = '' and  
Field228 = '' and  Field229 = '' and  Field230 = '' and  Field231 = '' and  
Field232 = '' and  Field233 = '' and  Field234 = '' and  Field235 = '' and  
Field236 = '' and  Field237 = '' and  Field238 = '' and  Field239 = '' and  
Field240 = '' and  Field241 = '' and  Field242 = '' and  Field243 = '' and  
Field244 = '' and  Field245 = '' and  Field246 = '' and  Field247 = '' and  
Field248 = '' and  Field249 = '' and  Field250 =

[jira] [Resolved] (SPARK-21600) The description of "this requires spark.shuffle.service.enabled to be set" for the spark.dynamicAllocation.enabled configuration item is not clear

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21600.
---
Resolution: Won't Fix

This is not worth the review overhead.

> The description of "this requires spark.shuffle.service.enabled to be set" 
> for the spark.dynamicAllocation.enabled configuration item is not clear
> --
>
> Key: SPARK-21600
> URL: https://issues.apache.org/jira/browse/SPARK-21600
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> The description of "this requires spark.shuffle.service.enabled to be set" 
> for the spark.dynamicAllocation.enabled configuration item is not clear. I am 
> not sure how to set spark.shuffle.service.enabled is true or false, so that 
> the user to guess, resulting in doubts. All i have changed here, stressed 
> that must spark.shuffle.service.enabled to be set true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21700) How can I get the MetricsSystem information

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21700.
---
Resolution: Invalid

> How can I get the MetricsSystem information
> ---
>
> Key: SPARK-21700
> URL: https://issues.apache.org/jira/browse/SPARK-21700
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: LiuXiangyu
>
> I want to get the information that shows on the spark Web UI, but I don't 
> want to write a spider to get those information from the website, and I know 
> those information are come from MetricsSystem, is there any way that I can 
> use the MetricsSystem in my program to get those metrics information?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123004#comment-16123004
 ] 

Sean Owen commented on SPARK-21701:
---

I think you'd have to show evidence that it's worth the extra buffer size and 
how it impacts performance

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21703) Why RPC message are transferred with header and body separately in TCP frame

2017-08-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21703.
---
Resolution: Invalid

This belongs on the mailing list not JIRA

> Why RPC message are transferred with header and body separately in TCP frame
> 
>
> Key: SPARK-21703
> URL: https://issues.apache.org/jira/browse/SPARK-21703
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: neoremind
>Priority: Trivial
>  Labels: RPC
>
> After seeing the details of how spark leverage netty, I found one question, 
> typically RPC message wire format would have a header+payload structure, and 
> netty uses a TransportFrameDecoder to deal with how to determine a complete 
> message from remote peer. But after using Wireshark sniffing tool, I found 
> that the message are sent separately with header and then followed by body, 
> although this works fine, but for underlying TCP there would be ACK segments 
> sent back to acknowledge, there might be a little bit redundancy since we can 
> sent them together and the header are usually very small. 
> The main reason can be found in MessageWithHeader class, since transferTo 
> method write tow times for header and body.
> Could someone help me understand the background story on why to implement in 
> such way?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 111 matches

Mail list logo