date:20190303

[jira] [Comment Edited] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783074#comment-16783074
 ] 

Gabor Somogyi edited comment on SPARK-26727 at 3/4/19 7:55 AM:
---

[~Udbhav Agrawal] I was dealing with "failOnDataLoss=false should not return 
duplicated records: v1" and was doing some debugging on the latest master 
branch.
The test creates "DontFailOnDataLoss" table which should be dropped at the end 
but after a while the test failed constantly with an exception which stated 
that the table already exists.
I've tried to overcome this issue by setting an overwrite flag but futher 
different exceptions arrived. Here I've felt that doesn't lead to my goal so 
just removed the directory.
Hope this info helps.


was (Author: gsomogyi):
I was dealing with "failOnDataLoss=false should not return duplicated records: 
v1" and was doing some debugging on the latest master branch.
The test creates "DontFailOnDataLoss" table which should be dropped at the end 
but after a while the test failed constantly with an exception which stated 
that the table already exists.
I've tried to overcome this issue by setting an overwrite flag but futher 
different exceptions arrived. Here I've felt that doesn't lead to my goal so 
just removed the directory.
Hope this info helps.

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-03 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783074#comment-16783074
 ] 

Gabor Somogyi commented on SPARK-26727:
---

I was dealing with "failOnDataLoss=false should not return duplicated records: 
v1" and was doing some debugging on the latest master branch.
The test creates "DontFailOnDataLoss" table which should be dropped at the end 
but after a while the test failed constantly with an exception which stated 
that the table already exists.
I've tried to overcome this issue by setting an overwrite flag but futher 
different exceptions arrived. Here I've felt that doesn't lead to my goal so 
just removed the directory.
Hope this info helps.

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
>

[jira] [Assigned] (SPARK-27038) Rack resolving takes a long time when initializing TaskSetManager

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27038:


Assignee: Apache Spark

> Rack resolving takes a long time when initializing TaskSetManager
> -
>
> Key: SPARK-27038
> URL: https://issues.apache.org/jira/browse/SPARK-27038
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> If submits a stage with abundant tasks, rack resolving takes a long time when 
> initializing TaskSetManager caused by a mass of loops to execute rack 
> resolving script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27038) Rack resolving takes a long time when initializing TaskSetManager

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27038:


Assignee: (was: Apache Spark)

> Rack resolving takes a long time when initializing TaskSetManager
> -
>
> Key: SPARK-27038
> URL: https://issues.apache.org/jira/browse/SPARK-27038
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Priority: Major
>
> If submits a stage with abundant tasks, rack resolving takes a long time when 
> initializing TaskSetManager caused by a mass of loops to execute rack 
> resolving script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-26961:

Comment: was deleted

(was: The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader
{code:java}
protected Object getClassLoadingLock(String className) {
Object lock = this;
if (parallelLockMap != null) {
Object newLock = new Object();
lock = parallelLockMap.putIfAbsent(className, newLock);
if (lock == null)

{ lock = newLock; }

}
return lock;
}{code}
 Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 
{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}
{code:java}
static {
sun.misc.SharedSecrets.setJavaNetAccess (
new sun.misc.JavaNetAccess() {
public URLClassPath getURLClassPath (URLClassLoader u)

{ return u.ucp; }

public String getOriginalHostName(InetAddress ia)

{ return ia.holder.getOriginalHostName(); }

}
);
ClassLoader.registerAsParallelCapable();
}{code}
So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock)

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
>

[jira] [Comment Edited] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783031#comment-16783031
 ] 

Ajith S edited comment on SPARK-26961 at 3/4/19 6:53 AM:
-

The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader
{code:java}
protected Object getClassLoadingLock(String className) {
Object lock = this;
if (parallelLockMap != null) {
Object newLock = new Object();
lock = parallelLockMap.putIfAbsent(className, newLock);
if (lock == null)

{ lock = newLock; }

}
return lock;
}{code}
 

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

 

 
{code:java}
static {
sun.misc.SharedSecrets.setJavaNetAccess (
new sun.misc.JavaNetAccess() {
public URLClassPath getURLClassPath (URLClassLoader u)

{ return u.ucp; }

public String getOriginalHostName(InetAddress ia)

{ return ia.holder.getOriginalHostName(); }

}
);
ClassLoader.registerAsParallelCapable();
}{code}
So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock


was (Author: ajithshetty):
The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

protected Object getClassLoadingLock(String className) {
 Object lock = this;
 if (parallelLockMap != null) {
 Object newLock = new Object();
 lock = parallelLockMap.putIfAbsent(className, newLock);
 if (lock == null) {
 lock = newLock;
 }
 }
 return lock;
}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

static {
 sun.misc.SharedSecrets.setJavaNetAccess (
 new sun.misc.JavaNetAccess() {
 public URLClassPath getURLClassPath (URLClassLoader u) {
 return u.ucp;
 }

 public String getOriginalHostName(InetAddress ia) {
 return ia.holder.getOriginalHostName();
 }
 }
 );
 *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*
}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at

[jira] [Comment Edited] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783031#comment-16783031
 ] 

Ajith S edited comment on SPARK-26961 at 3/4/19 6:54 AM:
-

The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader
{code:java}
protected Object getClassLoadingLock(String className) {
Object lock = this;
if (parallelLockMap != null) {
Object newLock = new Object();
lock = parallelLockMap.putIfAbsent(className, newLock);
if (lock == null)

{ lock = newLock; }

}
return lock;
}{code}
 Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 
{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}
{code:java}
static {
sun.misc.SharedSecrets.setJavaNetAccess (
new sun.misc.JavaNetAccess() {
public URLClassPath getURLClassPath (URLClassLoader u)

{ return u.ucp; }

public String getOriginalHostName(InetAddress ia)

{ return ia.holder.getOriginalHostName(); }

}
);
ClassLoader.registerAsParallelCapable();
}{code}
So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock


was (Author: ajithshetty):
The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader
{code:java}
protected Object getClassLoadingLock(String className) {
Object lock = this;
if (parallelLockMap != null) {
Object newLock = new Object();
lock = parallelLockMap.putIfAbsent(className, newLock);
if (lock == null)

{ lock = newLock; }

}
return lock;
}{code}
 

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

 

 
{code:java}
static {
sun.misc.SharedSecrets.setJavaNetAccess (
new sun.misc.JavaNetAccess() {
public URLClassPath getURLClassPath (URLClassLoader u)

{ return u.ucp; }

public String getOriginalHostName(InetAddress ia)

{ return ia.holder.getOriginalHostName(); }

}
);
ClassLoader.registerAsParallelCapable();
}{code}
So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at

[jira] [Comment Edited] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783031#comment-16783031
 ] 

Ajith S edited comment on SPARK-26961 at 3/4/19 6:52 AM:
-

The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

protected Object getClassLoadingLock(String className) {
 Object lock = this;
 if (parallelLockMap != null) {
 Object newLock = new Object();
 lock = parallelLockMap.putIfAbsent(className, newLock);
 if (lock == null) {
 lock = newLock;
 }
 }
 return lock;
}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

static {
 sun.misc.SharedSecrets.setJavaNetAccess (
 new sun.misc.JavaNetAccess() {
 public URLClassPath getURLClassPath (URLClassLoader u) {
 return u.ucp;
 }

 public String getOriginalHostName(InetAddress ia) {
 return ia.holder.getOriginalHostName();
 }
 }
 );
 *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*
}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock


was (Author: ajithshetty):
The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

{{protected Object getClassLoadingLock(String className) {}}
{{ Object lock = this;}}
{{ if (parallelLockMap != null) {}}
{{ Object newLock = new Object();}}
{{ lock = parallelLockMap.putIfAbsent(className, newLock);}}
{{ if (lock == null) {}}
{{ lock = newLock;}}
{{ }}}
{{ }}}
{{ return lock;}}
{{}}}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

{{static {}}
{{ sun.misc.SharedSecrets.setJavaNetAccess (}}
{{ new sun.misc.JavaNetAccess() {}}
{{ public URLClassPath getURLClassPath (URLClassLoader u) {}}
{{ return u.ucp;}}
{{ }}}

{{ public String getOriginalHostName(InetAddress ia) {}}
{{ return ia.holder.getOriginalHostName();}}
{{ }}}
{{ }}}
{{ );}}
{{ *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*}}
{{}}}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at

[jira] [Comment Edited] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783031#comment-16783031
 ] 

Ajith S edited comment on SPARK-26961 at 3/4/19 6:51 AM:
-

The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

{{protected Object getClassLoadingLock(String className) {}}
{{ Object lock = this;}}
{{ if (parallelLockMap != null) {}}
{{ Object newLock = new Object();}}
{{ lock = parallelLockMap.putIfAbsent(className, newLock);}}
{{ if (lock == null) {}}
{{ lock = newLock;}}
{{ }}}
{{ }}}
{{ return lock;}}
{{}}}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

{{static {}}
{{ sun.misc.SharedSecrets.setJavaNetAccess (}}
{{ new sun.misc.JavaNetAccess() {}}
{{ public URLClassPath getURLClassPath (URLClassLoader u) {}}
{{ return u.ucp;}}
{{ }}}

{{ public String getOriginalHostName(InetAddress ia) {}}
{{ return ia.holder.getOriginalHostName();}}
{{ }}}
{{ }}}
{{ );}}
{{ *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*}}
{{}}}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock


was (Author: ajithshetty):
The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

{{protected Object getClassLoadingLock(String className) {}}
{{ Object lock = this;}}
{{ if (parallelLockMap != null) {}}
{{ Object newLock = new Object();}}
{{ lock = parallelLockMap.putIfAbsent(className, newLock);}}
{{ if (lock == null)}}{{{ lock = newLock; }}}{{}}}
{{ return lock;}}
{{ }}}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

{{static {}}
{{ sun.misc.SharedSecrets.setJavaNetAccess (}}
{{ new sun.misc.JavaNetAccess() {}}
{{ public URLClassPath getURLClassPath (URLClassLoader u) {}}
{{ return u.ucp;}}
{{ }}}

{{ public String getOriginalHostName(InetAddress ia) {}}
{{ return ia.holder.getOriginalHostName();}}
{{ }}}
{{ }}}
{{ );}}
{{ *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*}}
{{}}}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at

[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-03-03 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783031#comment-16783031
 ] 

Ajith S commented on SPARK-26961:
-

The problem is here org.apache.spark.util.MutableURLClassLoader (entire 
classloader) is getting locked. Checking java.lang.ClassLoader

{{protected Object getClassLoadingLock(String className) {}}
{{ Object lock = this;}}
{{ if (parallelLockMap != null) {}}
{{ Object newLock = new Object();}}
{{ lock = parallelLockMap.putIfAbsent(className, newLock);}}
{{ if (lock == null)}}{{{ lock = newLock; }}}{{}}}
{{ return lock;}}
{{ }}}

Here we see for every loading, a new object is created so it doesn't lock 
entire classloader itself.

The only thing i see this happening is via 

{{ClassLoader.registerAsParallelCapable(); i.e @ static block of 
java.net.URLClassLoader}}

{{static {}}
{{ sun.misc.SharedSecrets.setJavaNetAccess (}}
{{ new sun.misc.JavaNetAccess() {}}
{{ public URLClassPath getURLClassPath (URLClassLoader u) {}}
{{ return u.ucp;}}
{{ }}}

{{ public String getOriginalHostName(InetAddress ia) {}}
{{ return ia.holder.getOriginalHostName();}}
{{ }}}
{{ }}}
{{ );}}
{{ *{color:#FF}ClassLoader.registerAsParallelCapable();{color}*}}
{{}}}

So all subclasses of URLClassLoader will lock entire classloader for 
classloading and cause this lock

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
>

[jira] [Commented] (SPARK-27020) Unable to insert data with partial dynamic partition with Spark & Hive 3

2019-03-03 Thread sandeep katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783002#comment-16783002
 ] 

sandeep katta commented on SPARK-27020:
---

Hi Can you please provide some details 

1.what is the schema of both the tables; 

2.And how the table t1 is created( create table command ).

 

 

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-27020
> URL: https://issues.apache.org/jira/browse/SPARK-27020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Truong Duc Kien
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26956) remove streaming output mode from data source v2 APIs

2019-03-03 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26956.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> remove streaming output mode from data source v2 APIs
> -
>
> Key: SPARK-26956
> URL: https://issues.apache.org/jira/browse/SPARK-26956
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-27020) Unable to insert data with partial dynamic partition with Spark & Hive 3

2019-03-03 Thread sandeep katta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-27020:
--
Comment: was deleted

(was: I would like to take up this jira, will be working on this jira)

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-27020
> URL: https://issues.apache.org/jira/browse/SPARK-27020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Truong Duc Kien
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-03 Thread Martin Loncaric (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-27015:

Affects Version/s: (was: 2.5.0)
   (was: 3.0.0)
   2.3.3
   2.4.0

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-03 Thread Martin Loncaric (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-27015:

Fix Version/s: 3.0.0
   2.5.0

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Fix For: 2.5.0, 3.0.0
>
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27015) spark-submit does not properly escape arguments sent to Mesos dispatcher

2019-03-03 Thread Martin Loncaric (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-27015:

Affects Version/s: (was: 2.3.3)
   (was: 2.4.0)
   3.0.0
   2.5.0

> spark-submit does not properly escape arguments sent to Mesos dispatcher
> 
>
> Key: SPARK-27015
> URL: https://issues.apache.org/jira/browse/SPARK-27015
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.5.0, 3.0.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Arguments sent to the dispatcher must be escaped; for instance,
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a 
> b$c"{noformat}
> fails, and instead must be submitted as
> {noformat}spark-submit --master mesos://url:port my.jar --arg1 "a\\ 
> b\\$c"{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26893) Allow partition pruning with subquery filters on file source

2019-03-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26893:
---

Assignee: Peter Toth

> Allow partition pruning with subquery filters on file source
> 
>
> Key: SPARK-26893
> URL: https://issues.apache.org/jira/browse/SPARK-26893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
>
> File source doesn't use subquery filters for partition pruning. But it could 
> use those filters with a minor improvement.
> This query is an example:
> {noformat}
> CREATE TABLE a (id INT, p INT) USING PARQUET PARTITIONED BY (p)
> CREATE TABLE b (id INT) USING PARQUET
> SELECT * FROM a WHERE p <= (SELECT MIN(id) FROM b){noformat}
> Where the executed plan of the SELECT currently is:
> {noformat}
> *(1) Filter (p#252L <= Subquery subquery250)
> : +- Subquery subquery250
> : +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> : +- Exchange SinglePartition
> : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> : +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> +- *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionCount: 2, PartitionFilters: [isnotnull(p#252L)], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> But it could be: 
> {noformat}
> *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionFilters: [isnotnull(p#252L), (p#252L <= Subquery 
> subquery250)], PushedFilters: [], ReadSchema: struct
> +- Subquery subquery250
> +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}
> and so partition pruning could work in {{FileSourceScanExec}}.
>  Please note that {{PartitionCount}} metadata can't be computed before 
> execution so in this case it is no longer part of the plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27020) Unable to insert data with partial dynamic partition with Spark & Hive 3

2019-03-03 Thread sandeep katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782980#comment-16782980
 ] 

sandeep katta commented on SPARK-27020:
---

I would like to take up this jira, will be working on this jira

> Unable to insert data with partial dynamic partition with Spark & Hive 3
> 
>
> Key: SPARK-27020
> URL: https://issues.apache.org/jira/browse/SPARK-27020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Hortonwork HDP 3.1.0
> Spark 2.3.2
> Hive 3
>Reporter: Truong Duc Kien
>Priority: Major
>
> When performing inserting data with dynamic partition, the operation fails if 
> all partitions are not dynamic. For example:
> The query
> {code:sql}
> insert overwrite table t1 (part_a='a', part_b) select * from t2
> {code}
> will fails with errors
> {code:xml}
> Cannot create partition spec from hdfs:/// ; missing keys [part_a]
> Ignoring invalid DP directory 
> {code}
> On the other hand, if I remove the static value of part_a to make the insert 
> fully dynamic, the following query will success.
> {code:sql}
> insert overwrite table t1 (part_a, part_b) select * from t2
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26893) Allow partition pruning with subquery filters on file source

2019-03-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26893.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23802
[https://github.com/apache/spark/pull/23802]

> Allow partition pruning with subquery filters on file source
> 
>
> Key: SPARK-26893
> URL: https://issues.apache.org/jira/browse/SPARK-26893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.0.0
>
>
> File source doesn't use subquery filters for partition pruning. But it could 
> use those filters with a minor improvement.
> This query is an example:
> {noformat}
> CREATE TABLE a (id INT, p INT) USING PARQUET PARTITIONED BY (p)
> CREATE TABLE b (id INT) USING PARQUET
> SELECT * FROM a WHERE p <= (SELECT MIN(id) FROM b){noformat}
> Where the executed plan of the SELECT currently is:
> {noformat}
> *(1) Filter (p#252L <= Subquery subquery250)
> : +- Subquery subquery250
> : +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> : +- Exchange SinglePartition
> : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> : +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> +- *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionCount: 2, PartitionFilters: [isnotnull(p#252L)], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> But it could be: 
> {noformat}
> *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionFilters: [isnotnull(p#252L), (p#252L <= Subquery 
> subquery250)], PushedFilters: [], ReadSchema: struct
> +- Subquery subquery250
> +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}
> and so partition pruning could work in {{FileSourceScanExec}}.
>  Please note that {{PartitionCount}} metadata can't be computed before 
> execution so in this case it is no longer part of the plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-03-03 Thread Udbhav Agrawal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782975#comment-16782975
 ] 

Udbhav Agrawal commented on SPARK-26727:


[~gsomogyi] can you tell me the details of the test

 

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
>

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782961#comment-16782961
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

I resolved it as Later mainly due to no feedback. I think it's fine to reopen. 
You can try to open a PR and fix it if the change is small. Otherwise, I doubt 
if this is worth.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Huon Wilson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782958#comment-16782958
 ] 

Huon Wilson edited comment on SPARK-26964 at 3/4/19 4:45 AM:
-

I see. Could you say why you're resolving it as Later? I'm not quite sure I 
understand how the error handling for corrupt records differs between this and 
the existing functionality in {{from_json}}, e.g. the corrupt record handling 
for decoding {{"x"}} as {{int}} seems to already exist (in the form of 
{{JacksonParser.parse}} converting an exception into a {{BadRecordException}}, 
and {{FailureSafeParser}} catching them) because the same error occurs when 
decoding {{\{"value":"x"\}}} as {{struct}}.

Along those lines, we're now using the following code to map arbitrary values 
to their JSON strings, and back. It involves wrapping the values in a struct, 
and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
// Nest the column in a struct so that to_json can work ...
val structJson = to_json(struct(col as TempName))
// ... but, because of this nesting, to_json(...) gives "{}" (not
// null) if col is null, while this function needs to preserve that
// null-ness.
val nullOrStruct = when(col.isNull, null).otherwise(structJson)

// Strip off the struct wrapping to pull out the JSON-ified `col`
regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
col: Column,
dataType: DataType,
nullable: Boolean
  ): Column = {
// from_json only works with a struct, so that's what we're going to be
// parsing.
val json_schema = new StructType().add(TempName, dataType, nullable)

// To be able to parse into a struct, the JSON column needs to be wrapped
// in what was stripped off above.
val structJson = concat(lit(Prefix), col, lit(Suffix))
// Now we're finally ready to parse
val parsedStruct = from_json(structJson, json_schema)
// ... and extract the field to get the actual parsed column.
parsedStruct(TempName)
  }
}
{code}


was (Author: huonw):
I see. Could you say why you're resolving it as Later? I'm not quite sure I 
understand how the error handling for corrupt records differs between this and 
the existing functionality in {{from_json}}, e.g. the corrupt record handling 
for decoding {{"x"}} as {{int}} seems to already exist (in the form of 
{{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and 
{{FailureSafeParser}} catching them) because the same error occurs when 
decoding {{\{"value":"x"\}}} as {{struct}}.

Along those lines, we're now using the following code to map arbitrary values 
to their JSON strings, and back. It involves wrapping the values in a struct, 
and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
// Nest the column in a struct so that to_json can work ...
val structJson = to_json(struct(col as TempName))
// ... but, because of this nesting, to_json(...) gives "{}" (not
// null) if col is null, while this function needs to preserve that
// null-ness.
val nullOrStruct = when(col.isNull, null).otherwise(structJson)

// Strip off the struct wrapping to pull out the JSON-ified `col`
regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
col: Column,
dataType: DataType,
nullable: Boolean
  ): Column = {
// from_json only works with a struct, so that's what we're going to be
// parsing.
val json_schema = new StructType().add(TempName, dataType, nullable)

// To be able to parse

[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Huon Wilson (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782958#comment-16782958
 ] 

Huon Wilson commented on SPARK-26964:
-

I see. Could you say why you're resolving it as Later? I'm not quite sure I 
understand how the error handling for corrupt records differs between this and 
the existing functionality in {{from_json}}, e.g. the corrupt record handling 
for decoding {{"x"}} as {{int}} seems to already exist (in the form of 
{{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and 
{{FailureSafeParser}} catching them) because the same error occurs when 
decoding {{\{"value":"x"\}}} as {{struct}}.

Along those lines, we're now using the following code to map arbitrary values 
to their JSON strings, and back. It involves wrapping the values in a struct, 
and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
// Nest the column in a struct so that to_json can work ...
val structJson = to_json(struct(col as TempName))
// ... but, because of this nesting, to_json(...) gives "{}" (not
// null) if col is null, while this function needs to preserve that
// null-ness.
val nullOrStruct = when(col.isNull, null).otherwise(structJson)

// Strip off the struct wrapping to pull out the JSON-ified `col`
regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
col: Column,
dataType: DataType,
nullable: Boolean
  ): Column = {
// from_json only works with a struct, so that's what we're going to be
// parsing.
val json_schema = new StructType().add(TempName, dataType, nullable)

// To be able to parse into a struct, the JSON column needs to be wrapped
// in what was stripped off above.
val structJson = concat(lit(Prefix), col, lit(Suffix))
// Now we're finally ready to parse
val parsedStruct = from_json(structJson, json_schema)
// ... and extract the field to get the actual parsed column.
parsedStruct(TempName)
  }
}
{code}

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision

2019-03-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27032.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23937
[https://github.com/apache/spark/pull/23937]

> Flaky test: 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog:
>  metadata directory collision
> ---
>
> Key: SPARK-27032
> URL: https://issues.apache.org/jira/browse/SPARK-27032
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 3.0.0
>
>
> Locally and on Jenkins, I've frequently seen this test fail:
> {code}
> Error Message
> The await method on Waiter timed out.
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: The await method on 
> Waiter timed out.
>   at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406)
>   at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133)
> ...
> {code}
> See for example 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/
> There aren't obvious errors or problems with the test. Because it passes 
> sometimes, my guess is that the timeout is simply too short or the test too 
> long. I'd like to try reducing the number of threads/batches in the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27027:
-
Description: 
{{from_avro}} function produces wrong output of a struct field.  See the output 
at the bottom of the description

{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions._


spark.version

val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
50)).toDF("id", "name", "age")

val dfStruct = df.withColumn("value", struct("name","age"))

dfStruct.show
dfStruct.printSchema

val dfKV = dfStruct.select(to_avro('id).as("key"), to_avro('value).as("value"))

val expectedSchema = StructType(Seq(StructField("name", StringType, 
true),StructField("age", IntegerType, false)))

val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString

val avroTypeStr = s"""
 |{
 | "type": "int",
 | "name": "key"
 |}
 """.stripMargin


dfKV.select(from_avro('key, avroTypeStr)).show
dfKV.select(from_avro('value, avroTypeStruct)).show

// output for the last statement and that is not correct
+-+
|from_avro(value, struct)|
+-+
| [Josh Duke, 50]|
| [Josh Duke, 50]|
| [Josh Duke, 50]|
+-+
{code}

  was:
from_avro function produces wrong output of a struct field.  See the output at 
the bottom of the description

=

import org.apache.spark.sql.types._
import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions._


spark.version

val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
50)).toDF("id", "name", "age")

val dfStruct = df.withColumn("value", struct("name","age"))

dfStruct.show
dfStruct.printSchema

val dfKV = dfStruct.select(to_avro('id).as("key"), to_avro('value).as("value"))

val expectedSchema = StructType(Seq(StructField("name", StringType, 
true),StructField("age", IntegerType, false)))

val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString

val avroTypeStr = s"""
 |{
 | "type": "int",
 | "name": "key"
 |}
 """.stripMargin


dfKV.select(from_avro('key, avroTypeStr)).show
dfKV.select(from_avro('value, avroTypeStruct)).show

// output for the last statement and that is not correct
+-+
|from_avro(value, struct)|
+-+
| [Josh Duke, 50]|
| [Josh Duke, 50]|
| [Josh Duke, 50]|
+-+


> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27030) DataFrameWriter.insertInto fails when writing in parallel to a hive table

2019-03-03 Thread Shivu Sondur (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782943#comment-16782943
 ] 

Shivu Sondur commented on SPARK-27030:
--

I am checking this issue

> DataFrameWriter.insertInto fails when writing in parallel to a hive table
> -
>
> Key: SPARK-27030
> URL: https://issues.apache.org/jira/browse/SPARK-27030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lev Katzav
>Priority: Major
>
> When writing to a hive table, the following temp directory is used:
> {code:java}
> /path/to/table/_temporary/0/{code}
> (the 0 at the end comes from the config
> {code:java}
> "mapreduce.job.application.attempt.id"{code}
> since that config is missing, it falls back to 0)
> when there are 2 processes that write to the same table, there could be the 
> following race condition:
>  # p1 creates temp folder and uses it
>  # p2 uses temp folder
>  # p1 finishes and deletes temp folder
>  # p2 fails since temp folder is missing
>  
> It is possible to recreate this error locally with the following code:
> (the code runs locally, but I experienced the same error when running on a 
> cluster
> with 2 jobs writing to the same table)
> {code:java}
> import org.apache.spark.sql.functions._
> val df = spark
>  .range(1000)
>  .toDF("a")
>  .withColumn("partition", lit(0))
>  .cache()
> //create db
> sqlContext.sql("CREATE DATABASE IF NOT EXISTS db").count()
> //create table
> df
>  .write
>  .partitionBy("partition")
>  .saveAsTable("db.table")
> val x = (1 to 100).par
> x.tasksupport = new ForkJoinTaskSupport( new ForkJoinPool(10))
> //insert to different partitions in parallel
> x.foreach { p =>
>  val df2 = df
>  .withColumn("partition",lit(p))
>   df2
>.write
>.mode(SaveMode.Overwrite)
>.insertInto("db.table")
> }
> {code}
>  
>  the error would be:
> {code:java}
> java.io.FileNotFoundException: File 
> file:/path/to/warehouse/db.db/table/_temporary/0 does not exist
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:406)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:669)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
>  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
>  at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:283)
>  at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:325)
>  at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>  at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:185)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>

[jira] [Created] (SPARK-27038) Rack resolving takes a long time when initializing TaskSetManager

2019-03-03 Thread Lantao Jin (JIRA)

Lantao Jin created SPARK-27038:
--

 Summary: Rack resolving takes a long time when initializing 
TaskSetManager
 Key: SPARK-27038
 URL: https://issues.apache.org/jira/browse/SPARK-27038
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.4.0
Reporter: Lantao Jin


If submits a stage with abundant tasks, rack resolving takes a long time when 
initializing TaskSetManager caused by a mass of loops to execute rack resolving 
script.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27028) PySpark read .dat file. Multiline issue

2019-03-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782941#comment-16782941
 ] 

Hyukjin Kwon commented on SPARK-27028:
--

If newline is included in the data, it should be quoted to recognise it as a 
valid row.

> PySpark read .dat file. Multiline issue
> ---
>
> Key: SPARK-27028
> URL: https://issues.apache.org/jira/browse/SPARK-27028
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Pyspark(2.4) in AWS EMR
>Reporter: alokchowdary
>Priority: Critical
>
> * I am trying to read the dat file using pyspark csv reader and it contains 
> newline character ("\n") as part of the data. Spark is unable to read this 
> file as single column, rather treating it as new row. I tried using the 
> "multiLine" option while reading , but still its not working.
>  * {{spark.read.csv(file_path, schema=schema, sep=delimiter,multiLine=True)}}
>  * {{}}Data is something like this. Every line below is considered as row in 
> dataframe.
>  * Here  '\x01' is actual delimeter(but used , for ease of reading).
> {{ }}
> {{1. name,test,12345,}}
> {{2. x, }}
> {{3. desc }}
> {{4. name2,test2,12345 }}
> {{5. ,y}}
> {{6. ,desc2}}
>  * {{}}So pyspark is treating x and desc as new row in dataframe, with nulls 
> for other columns.
> How to read such data in pyspark 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27037) Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value

2019-03-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27037.
--
Resolution: Not A Problem

> Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value
> -
>
> Key: SPARK-27037
> URL: https://issues.apache.org/jira/browse/SPARK-27037
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tanjin Panna
>Priority: Major
>
> When we have a tuple as the key or value in a {{MapType}} and call the 
> .{{asDict()}}, we still have a {{Row}} in the output for the key and value:
> {code:java}
> >>> from pyspark.sql import Row 
>  df = spark.createDataFrame(
>   [ 
> Row(tuple_map={('hello', True): (1234, 111)}), 
> Row(tuple_map={('there', False): (5678, 343)}) 
>   ] 
> ) 
> >>> df.schema
> StructType(List(StructField(tuple_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))
> >>> df.show(truncate=False)
> +---+
> |tuple_map                  |
> +---+
> |[[hello, true] -> [1234, 111]] |
> |[[there, false] -> [5678, 343]]|
> +---+
> >>> df.head().asDict()
> {'tuple_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27037) Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value

2019-03-03 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782940#comment-16782940
 ] 

Hyukjin Kwon commented on SPARK-27037:
--

Use {{asDict(recursive=True)}}

{code}
>>> df.head().asDict(recursive=True)
{'tuple_map': {Row(_1=u'hello', _2=True): {'_2': 111, '_1': 1234}}}
{code}

> Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value
> -
>
> Key: SPARK-27037
> URL: https://issues.apache.org/jira/browse/SPARK-27037
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tanjin Panna
>Priority: Major
>
> When we have a tuple as the key or value in a {{MapType}} and call the 
> .{{asDict()}}, we still have a {{Row}} in the output for the key and value:
> {code:java}
> >>> from pyspark.sql import Row 
>  df = spark.createDataFrame(
>   [ 
> Row(tuple_map={('hello', True): (1234, 111)}), 
> Row(tuple_map={('there', False): (5678, 343)}) 
>   ] 
> ) 
> >>> df.schema
> StructType(List(StructField(tuple_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))
> >>> df.show(truncate=False)
> +---+
> |tuple_map                  |
> +---+
> |[[hello, true] -> [1234, 111]] |
> |[[there, false] -> [5678, 343]]|
> +---+
> >>> df.head().asDict()
> {'tuple_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26964.
--
Resolution: Later

Let me leave this resolved as Later for now.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27001) Refactor "serializerFor" method between ScalaReflection and JavaTypeInference

2019-03-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27001.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23908
[https://github.com/apache/spark/pull/23908]

> Refactor "serializerFor" method between ScalaReflection and JavaTypeInference
> -
>
> Key: SPARK-27001
> URL: https://issues.apache.org/jira/browse/SPARK-27001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> While we address SPARK-22000 we refactored "deserializer" method between 
> ScalaReflection and JavaTypeInference since lots of things were duplicated, 
> and even code had been diverged to let Java side missing upcast.
> Once we take a step on refactoring, I would propose refactoring 
> "serializerFor" since similar duplications are observed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27001) Refactor "serializerFor" method between ScalaReflection and JavaTypeInference

2019-03-03 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27001:
---

Assignee: Jungtaek Lim

> Refactor "serializerFor" method between ScalaReflection and JavaTypeInference
> -
>
> Key: SPARK-27001
> URL: https://issues.apache.org/jira/browse/SPARK-27001
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> While we address SPARK-22000 we refactored "deserializer" method between 
> ScalaReflection and JavaTypeInference since lots of things were duplicated, 
> and even code had been diverged to let Java side missing upcast.
> Once we take a step on refactoring, I would propose refactoring 
> "serializerFor" since similar duplications are observed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27037) Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value

2019-03-03 Thread Tanjin Panna (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanjin Panna updated SPARK-27037:
-
Description: 
When we have a tuple as the key or value in a {{MapType}} and call the 
.{{asDict()}}, we still have a {{Row}} in the output for the key and value:
{code:java}
>>> from pyspark.sql import Row 
 df = spark.createDataFrame(
  [ 
Row(tuple_map={('hello', True): (1234, 111)}), 
Row(tuple_map={('there', False): (5678, 343)}) 
  ] 
) 

>>> df.schema
StructType(List(StructField(tuple_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))

>>> df.show(truncate=False)
+---+
|tuple_map                  |
+---+
|[[hello, true] -> [1234, 111]] |
|[[there, false] -> [5678, 343]]|
+---+

>>> df.head().asDict()
{'tuple_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
{code}
 

  was:
When we have a tuple as the key or value in a {{MapType}} and call the 
.{{asDict()}}, we still have a {{Row}} in the output for the key and value:
{code:java}
>>> from pyspark.sql import Row 
 df = spark.createDataFrame(
  [ 
Row(tuple_key_map={('hello', True): (1234, 111)}), 
Row(tuple_key_map={('there', False): (5678, 343)}) 
  ] 
) 

>>> df.schema
StructType(List(StructField(tuple_key_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))

>>> df.show(truncate=False)
+---+
|tuple_key_map                  |
+---+
|[[hello, true] -> [1234, 111]] |
|[[there, false] -> [5678, 343]]|
+---+

>>> df.head().asDict()
{'tuple_key_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
{code}
 


> Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value
> -
>
> Key: SPARK-27037
> URL: https://issues.apache.org/jira/browse/SPARK-27037
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tanjin Panna
>Priority: Major
>
> When we have a tuple as the key or value in a {{MapType}} and call the 
> .{{asDict()}}, we still have a {{Row}} in the output for the key and value:
> {code:java}
> >>> from pyspark.sql import Row 
>  df = spark.createDataFrame(
>   [ 
> Row(tuple_map={('hello', True): (1234, 111)}), 
> Row(tuple_map={('there', False): (5678, 343)}) 
>   ] 
> ) 
> >>> df.schema
> StructType(List(StructField(tuple_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))
> >>> df.show(truncate=False)
> +---+
> |tuple_map                  |
> +---+
> |[[hello, true] -> [1234, 111]] |
> |[[there, false] -> [5678, 343]]|
> +---+
> >>> df.head().asDict()
> {'tuple_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27037) Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value

2019-03-03 Thread Tanjin Panna (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanjin Panna updated SPARK-27037:
-
Description: 
When we have a tuple as the key or value in a {{MapType}} and call the 
.{{asDict()}}, we still have a {{Row}} in the output for the key and value:
{code:java}
>>> from pyspark.sql import Row 
 df = spark.createDataFrame(
  [ 
Row(tuple_key_map={('hello', True): (1234, 111)}), 
Row(tuple_key_map={('there', False): (5678, 343)}) 
  ] 
) 

>>> df.schema
StructType(List(StructField(tuple_key_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))

>>> df.show(truncate=False)
+---+
|tuple_key_map                  |
+---+
|[[hello, true] -> [1234, 111]] |
|[[there, false] -> [5678, 343]]|
+---+

>>> df.head().asDict()
{'tuple_key_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
{code}
 

  was:
When we have a tuple as the key or value in a `MapType` and call the 
`asDict()`, we still have a `Row` in the output for the key and value:
{code}
>>> from pyspark.sql import Row 
 df = spark.createDataFrame(
  [ 
Row(tuple_key_map={('hello', True): (1234, 111)}), 
Row(tuple_key_map={('there', False): (5678, 343)}) 
  ] 
) 

>>> df.schema
StructType(List(StructField(tuple_key_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))

>>> df.show(truncate=False)
+---+
|tuple_key_map                  |
+---+
|[[hello, true] -> [1234, 111]] |
|[[there, false] -> [5678, 343]]|
+---+

>>> df.head().asDict()
{'tuple_key_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
{code}
 


> Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value
> -
>
> Key: SPARK-27037
> URL: https://issues.apache.org/jira/browse/SPARK-27037
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tanjin Panna
>Priority: Major
>
> When we have a tuple as the key or value in a {{MapType}} and call the 
> .{{asDict()}}, we still have a {{Row}} in the output for the key and value:
> {code:java}
> >>> from pyspark.sql import Row 
>  df = spark.createDataFrame(
>   [ 
> Row(tuple_key_map={('hello', True): (1234, 111)}), 
> Row(tuple_key_map={('there', False): (5678, 343)}) 
>   ] 
> ) 
> >>> df.schema
> StructType(List(StructField(tuple_key_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))
> >>> df.show(truncate=False)
> +---+
> |tuple_key_map                  |
> +---+
> |[[hello, true] -> [1234, 111]] |
> |[[there, false] -> [5678, 343]]|
> +---+
> >>> df.head().asDict()
> {'tuple_key_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27037) Pyspark Row .asDict() cannot handle MapType with a Struct as the key or value

2019-03-03 Thread Tanjin Panna (JIRA)

Tanjin Panna created SPARK-27037:


 Summary: Pyspark Row .asDict() cannot handle MapType with a Struct 
as the key or value
 Key: SPARK-27037
 URL: https://issues.apache.org/jira/browse/SPARK-27037
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Tanjin Panna


When we have a tuple as the key or value in a `MapType` and call the 
`asDict()`, we still have a `Row` in the output for the key and value:
{code}
>>> from pyspark.sql import Row 
 df = spark.createDataFrame(
  [ 
Row(tuple_key_map={('hello', True): (1234, 111)}), 
Row(tuple_key_map={('there', False): (5678, 343)}) 
  ] 
) 

>>> df.schema
StructType(List(StructField(tuple_key_map,MapType(StructType(List(StructField(_1,StringType,true),StructField(_2,BooleanType,true))),StructType(List(StructField(_1,LongType,true),StructField(_2,LongType,true))),true),true)))

>>> df.show(truncate=False)
+---+
|tuple_key_map                  |
+---+
|[[hello, true] -> [1234, 111]] |
|[[there, false] -> [5678, 343]]|
+---+

>>> df.head().asDict()
{'tuple_key_map': {Row(_1=u'hello', _2=True): Row(_1=1234, _2=111)}}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25863:


Assignee: (was: Apache Spark)

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
>

[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2019-03-03 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782903#comment-16782903
 ] 

Takeshi Yamamuro commented on SPARK-25863:
--

Yea, returning 0 sounds reasonable to me, too.

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
>

[jira] [Commented] (SPARK-21871) Check actual bytecode size when compiling generated code

2019-03-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782904#comment-16782904
 ] 

Apache Spark commented on SPARK-21871:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23947

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Critical
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21871) Check actual bytecode size when compiling generated code

2019-03-03 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782905#comment-16782905
 ] 

Apache Spark commented on SPARK-21871:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23947

> Check actual bytecode size when compiling generated code
> 
>
> Key: SPARK-21871
> URL: https://issues.apache.org/jira/browse/SPARK-21871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Critical
> Fix For: 2.3.0
>
>
> In SPARK-21603, we added code to give up code compilation and use interpreter 
> execution in SparkPlan if the line number of generated functions goes over 
> maxLinesPerFunction. But, we already have code to collect metrics for 
> compiled bytecode size in `CodeGenerator` object. So, I think we could easily 
> reuse the code for this purpose.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25863:


Assignee: Apache Spark

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Assignee: Apache Spark
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
>

[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-03-03 Thread Mani M (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782900#comment-16782900
 ] 

Mani M commented on SPARK-26918:


Hi [~felixcheung]

Just to check how to remove rat filter for .md files.

In Updating the spark website

The below lines were given
 # {{Build the latest docs }}
 # {{$ git checkout v1.1.1 }}
 # {{$ cd docs}}

Check out will be the master and the release number will be 2.3.3 ?

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782876#comment-16782876
 ] 

Sean Owen commented on SPARK-26247:
---

There are two issues here -- load time of the model, and scoring outside Spark? 
What's the issue with load time? surely that happens once before serving. Is it 
that you want to read a PipelineModel directly, without Spark? OK, that's more 
interesting. The thing can't be scored without Spark without some form of 
transformation and that's mostly what MLeap does. I'm trying to understand why 
this is enough different that it needs to be in Spark as the 'blessed' 
solution. Putting the maintenance onto this project is more bug than feature. 

There are unfortunately several incomplete attempts to do something like this: 
mllib-local, PMML export. I hesitate to add another. 

If the idea is supporting single-instance scoring of models, that partly exists 
in some models already and in mllib-local.

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-03-03 Thread Mani M (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782874#comment-16782874
 ] 

Mani M commented on SPARK-26918:


Ok will add and raise the PR

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2019-03-03 Thread Anne Holler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782863#comment-16782863
 ] 

Anne Holler commented on SPARK-26247:
-

Hi, [~skonto] and [~srowen],

Thank you for your comments!  We hear your points: 1) that the idea that the 
Spark pipeline model representation can serve as the representation that 'rules 
them all' (as is the aspiration of, e.g., PMML and PFA) can be viewed as a 
Spark lock-in, and 2) that for many folks MLeap adequately solves the problem 
of online serving Spark pipeline models.

That said, our opinion is that there is general value to the Spark community in 
our small patch to the Spark codebase to reduce Spark pipeline model load time 
and to support low-latency scoring, so that online serving can be performed 
directly from the Spark model representation, if desired.  We believe that 
updating Spark while retaining its on-disk format (rather than depending on an 
external codebase with an alternative on-disk format, as is the case with 
MLeap) simplifies keeping the online and offline serving code paths consistent 
and lessens the risk of model serving mismatch.

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25130) [Python] Wrong timestamp returned by toPandas

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782846#comment-16782846
 ] 

Sean Owen commented on SPARK-25130:
---

[~maxgekk] is this likely fixed by your overhaul of time parsing?

> [Python] Wrong timestamp returned by toPandas
> -
>
> Key: SPARK-25130
> URL: https://issues.apache.org/jira/browse/SPARK-25130
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0, 2.3.1
> Environment: Tested with version 2.3.1 on OSX and 2.3.0 on Linux.
>Reporter: Anton Daitche
>Priority: Major
>
> The code snippet
> {code:java}
> import datetime
> df = spark.createDataFrame([(datetime.datetime(1901, 1, 1, 0, 0, 0),)], 
> ["ts"])
> print("collect:", df.collect()[0][0])
> print("toPandas:", df.toPandas().iloc[0, 0])
>  {code}
> prints
> {code:java}
> collect: 1901-01-01 00:00:00
> toPandas: 1900-12-31 23:53:00{code}
> Hence the toPandas methods seems to convert the timestamp wrongly.
> The problem disappears for "1902-01-01 00:00:00" and later dates (I didn't do 
> an exhaustive test though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25201) Synchronization performed on AtomicReference in LevelDB class

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25201.
---
Resolution: Invalid

I don't see a problem statement here

> Synchronization performed on AtomicReference in LevelDB class
> -
>
> Key: SPARK-25201
> URL: https://issues.apache.org/jira/browse/SPARK-25201
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Ted Yu
>Priority: Minor
>
> Here is related code:
> {code}
>   void closeIterator(LevelDBIterator it) throws IOException {
> synchronized (this._db) {
> {code}
> There are more than one occurrence of such usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25350) Spark Serving

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782844#comment-16782844
 ] 

Sean Owen commented on SPARK-25350:
---

I think this kind of thing is great, but belongs outside the Spark core.

> Spark Serving
> -
>
> Key: SPARK-25350
> URL: https://issues.apache.org/jira/browse/SPARK-25350
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mark Hamilton
>Priority: Major
>  Labels: features
>
> Microsoft has created a new system to turn Structured Streaming jobs into 
> RESTful web services. We would like to commit this work back to the 
> community. 
> More information can be found at the [ MMLSpark 
> website|[http://www.aka.ms/spark]]
> And the [ Spark Serving Documentation 
> page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]]
>  
> The code can be found in the MMLSpark Repo and a PR will be made soon:
> [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala]
>  
> Thanks for your help and feedback!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25405) Saving RDD with new Hadoop API file as a Sequence File too restrictive

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25405.
---
Resolution: Not A Problem

Looks like you are using the old Mapreduce OutputFormat classes with the 'new' 
API methods in Spark. This isn't a bug. Use the newer OutputFormat 
implementations under .mapreduce

> Saving RDD with new Hadoop API file as a Sequence File too restrictive
> --
>
> Key: SPARK-25405
> URL: https://issues.apache.org/jira/browse/SPARK-25405
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Marcin Gasior
>Priority: Major
>
> I tried to transform Hbase export (sequence file) using spark job, and face a 
> compilation issue:
>  
> {code:java}
> val hc = sc.hadoopConfiguration
> val serializers = List(
>   classOf[WritableSerialization].getName,
>   classOf[ResultSerialization].getName
> ).mkString(",")
> hc.set("io.serializations", serializers)
> val c = new Configuration(sc.hadoopConfiguration)
> c.set("mapred.input.dir", sourcePath)
> val subsetRDD = sc.newAPIHadoopRDD(
>   c,
>   classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]],
>   classOf[ImmutableBytesWritable],
>   classOf[Result])
> subsetRDD.saveAsNewAPIHadoopFile(
>   "output/sequence",
>   classOf[ImmutableBytesWritable],
>   classOf[Result],
>   classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],
>   hc
> )
> {code}
>  
>  
> During compilation I received:
> {code:java}
> Error: type mismatch
> Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat])
>  
> required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] 
> classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code}
>  
> By using Hadoop low-level api I could workaround the issue with following:
> {code:java}
> val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")),
>   Writer.keyClass(classOf[ImmutableBytesWritable]),
>   Writer.valueClass(classOf[Result]),
>   Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)),
>   Writer.replication(fs.getDefaultReplication()),
>   Writer.blockSize(1073741824),
>   Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()),
>   Writer.progressable(null),
>   Writer.metadata(new Metadata()))
> subset.foreach(p => writer.append(p._1, p._2))
> IOUtils.closeStream(writer)
> {code}
>  
> I think that the interface is too restrictive, and does not allow to pass 
> external (de)serializers
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25441) calculate term frequency in CountVectorizer()

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25441.
---
Resolution: Won't Fix

What you have there is already term frequency. If you want to normalize it to 
some kind of term fraction, you can just make that transformation yourself.

> calculate term frequency in CountVectorizer()
> -
>
> Key: SPARK-25441
> URL: https://issues.apache.org/jira/browse/SPARK-25441
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xinyong Tian
>Priority: Major
>
> currently CountVectorizer() can not output TF (term frequency). I hope there 
> will be such option.
> TF defined as https://en.m.wikipedia.org/wiki/Tf–idf
>  
> example,
> >>> df = spark.createDataFrame( ...  [(0, ["a", "b", "c"]), (1, ["a", "b", 
> >>> "b", "c", "a"])], ...  ["label", "raw"])
> >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> >>> model = cv.fit(df)
> >>> model.transform(df).limit(1).show(truncate=False)
> label        raw           vectors 
> 0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])
>  
> instead I want 
> 0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector 
> devided by by its sum, here 3, so                                             
>                                     sum of new vector will 1,for every 
> row(document)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25552) Upgrade from Spark 1.6.3 to 2.3.0 seems to make jobs use about 50% more memory

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25552.
---
Resolution: Invalid

This is too broad. Literally 1 things change from 1.6 to 2.3. You'd have to 
at least narrow this down to what's taking so much memory. 

> Upgrade from Spark 1.6.3 to 2.3.0 seems to make jobs use about 50% more memory
> --
>
> Key: SPARK-25552
> URL: https://issues.apache.org/jira/browse/SPARK-25552
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Originally found in an AWS Kubernetes environment with 
> Spark Embedded.
> Also happens in a small scale with Spark Embedded both in Linux and MacOS.
>Reporter: Nuno Azevedo
>Priority: Major
> Attachments: Spark1.6-50GB.png, Spark2.3-50GB.png, Spark2.3-70GB.png
>
>
> After upgrading from Spark 1.6.3 to 2.3.0 our jobs started to need about 50% 
> more memory to run. The Spark properties used were the defaults in both 
> versions.
>  
> For instance, before we were running a job with Spark 1.6.3 and it was 
> running fine with 50 GB of memory.
> !Spark1.6-50GB.png|width=800,height=456!
>  
> After upgrading to Spark 2.3.0, when running the same job again with the same 
> 50 GB of memory it failed due to out of memory.
> !Spark2.3-50GB.png|width=800,height=366!
>  
> Then, we started incrementing the memory until we were able to run the job, 
> which was with 70 GB.
> !Spark2.3-70GB.png|width=800,height=366!
>  
> The Spark upgrade was the only change in our environment. After taking a look 
> at what seems to be causing this we noticed that Kryo Serializer is the main 
> culprit for the raise in memory consumption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25466) Documentation does not specify how to set Kafka consumer cache capacity for SS

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25466:
--
 Labels:   (was: doc ss)
   Priority: Minor  (was: Major)
Component/s: Documentation
 Issue Type: Improvement  (was: Bug)

> Documentation does not specify how to set Kafka consumer cache capacity for SS
> --
>
> Key: SPARK-25466
> URL: https://issues.apache.org/jira/browse/SPARK-25466
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> When hitting this warning with SS:
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30)
> If you Google you get to this page:
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> Which is for Spark Streaming and says to use this config item to adjust the 
> capacity: "spark.streaming.kafka.consumer.cache.maxCapacity".
> This is a bit confusing as SS uses a different config item: 
> "spark.sql.kafkaConsumerCache.capacity"
> Perhaps the SS Kafka documentation should talk about the consumer cache 
> capacity?  Perhaps here?
> https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
> Or perhaps the warning message should reference the config item.  E.g
> 19-09-2018 12:05:27 WARN  CachedKafkaConsumer:66 - KafkaConsumer cache 
> hitting max capacity of 64, removing consumer for 
> CacheKey(spark-kafka-source-e06c9676-32c6-49c4-80a9-2d0ac4590609--694285871-executor,MyKafkaTopic-30).
>   *The cache size can be adjusted with the setting 
> "spark.sql.kafkaConsumerCache.capacity".*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782841#comment-16782841
 ] 

Sean Owen commented on SPARK-25544:
---

I think this is a reasonable change -- you can test it in a PR if you're up for 
it

> Slow/failed convergence in Spark ML models due to internal predictor scaling
> 
>
> Key: SPARK-25544
> URL: https://issues.apache.org/jira/browse/SPARK-25544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
>Reporter: Andrew Crosby
>Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a 
> large number of iterations to converge, or fail to converge altogether, when 
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by 
> default. In SPARK-8522 the option to disable standardization was added. This 
> is implemented internally by changing the effective strength of 
> regularization rather than disabling the feature scaling. Mathematically, 
> both changing the effective regularizaiton strength, and disabling feature 
> scaling should give the same solution, but they can have very different 
> convergence properties.
> The normal justication given for scaling features is that it ensures that all 
> covariances are O(1) and should improve numerical convergence, but this 
> argument does not account for the regularization term. This doesn't cause any 
> issues if standardization is set to true, since all features will have an 
> O(1) regularization strength. But it does cause issues when standardization 
> is set to false, since the effecive regularization strength of feature i is 
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. 
> This means that predictors with small standard deviations (which can occur 
> legitimately e.g. via one hot encoding) will have very large effective 
> regularization strengths and consequently lead to very large gradients and 
> thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very 
> simple test case which builds a linear regression model with a categorical 
> feature, a numerical feature and their interaction. When fed the specified 
> training data, this model will fail to converge before it hits the maximum 
> iteration limit. In this case, it is the interaction between category "2" and 
> the numeric feature that leads to a feature with a small standard deviation.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>  
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category") 
> .setOutputCol("categoryIndex")
> val encoder = new 
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
> "interaction")).setOutputCol("features")
> val model = new 
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
> assembler, model))
> val pipelineModel  = pipeline.fit(df)
> val numIterations = 
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
>  *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when 
> standardization is set to false rather than using an effective regularization 
> strength. This can be hacked into LinearRegression.scala by simply replacing 
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization)) 
> featuresSummarizer.variance.toArray.map(math.sqrt) else 
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to 
> convergence after just 4 iterations instead of hitting the max iterations 
> limit!
> *Impact:*
> I can't speak for other people, but I've personally encountered these 
> convergence issues several times when building production scale Spark ML 
> models, and have resorted to writing my only implementation of 
> LinearRegression with the above hack

[jira] [Commented] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Sujith (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782840#comment-16782840
 ] 

Sujith commented on SPARK-27036:


It seems to be the problem area is   BroadcastExchangeExec  in driver where  as 
part of Future a particular job will be fired and collected data will be 
broadcasted. 

The main problem is system will submit the job and its respective stage/tasks 
through DAGScheduler,  where the scheduler thread will schedule the respective 
events , In BroadcastExchangeExec when future time out happens respective 
exception will thrown but the jobs/task which is  scheduled by  the  
DAGScheduler as part of the action called in future will not be cancelled, I 
think we shall cancel the respective job to avoid  running the same in  
background even after Future time out exception, this can help to terminate the 
job promptly when TimeOutException happens, this will also save the additional 
resources getting utilized even after timeout exception thrown from driver. 

I want to give an attempt to handle this issue, Any comments suggestions are 
welcome.

cc [~sro...@scient.com] [~b...@cloudera.com] [~hvanhovell]

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
>  
> !image-2019-03-04-00-39-38-779.png!
> !image-2019-03-04-00-39-12-210.png!
>  
>  wait for some time
> !image-2019-03-04-00-38-52-401.png!
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
>  create Table t1(Big Table,1M Records)
>  val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
>  val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
>  spark.sql("set spark.sql.broadcastTimeout=2").show(false)
>  Run Below Query 
>  spark.sql("create table s using parquet as select t1.* from csv_2 as 
> t1,csv_1 as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25550) [Spark Job History] Environment Page of Spark Job History UI showing wrong value for spark.ui.retainedJobs

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25550.
---
Resolution: Won't Fix

> [Spark Job History] Environment Page of Spark Job History UI  showing wrong 
> value for spark.ui.retainedJobs
> ---
>
> Key: SPARK-25550
> URL: https://issues.apache.org/jira/browse/SPARK-25550
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Screenshot from 2018-09-27 12-17-44.png, Screenshot from 
> 2018-09-27 12-19-05.png
>
>
> # spark.ui.retainedJobs =200( spark-default.conf of Job History)
> 2. Launch spark-shell --master yarn --conf spark.ui.retainedJobs=100
> 3. Execute Below command for creating 1000 jobs
> val rdd = sc.parallelize(1 to 5, 5)
> for(i <- 1 to 1000){
>  rdd.count
> }
> 4. Launch Job History Page
> 5. Click on corresponding spark-shell Application ID link in Job History and 
> Launch Job Page for the application
> 6. Go to Environment Page check the value against spark.ui.retainedJobs. It 
> shows spark.ui.retainedJobs = 100
> 7. Check the Total number of Jobs Summery Display in Job Page. It shows 
> Completed only 200 which is configured in spark-default.conf file of Job 
> History
> Actual Result: 
> Environment Page showing spark.ui.retainedJobs value which is set from 
> command prompt but Job Page displaying number of Jobs showing
> based on value set in spark-default.conf of Job History
> Expected Result:
> Value of spark.ui.retainedJobs in Environment Page should display  the value 
> which is configured in spark-default.conf of Job History



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25733.
---
Resolution: Cannot Reproduce

I can't reproduce this; with a simple local test (and Spark unit tests) 
toLocalIterator() works. You show you end up with some other error, and it's 
not clear what the cause is, but appears from this to be related to your code 
or data source.

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another

[jira] [Resolved] (SPARK-25562) The Spark add audit log

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25562.
---
Resolution: Invalid

> The Spark add audit log
> ---
>
> Key: SPARK-25562
> URL: https://issues.apache.org/jira/browse/SPARK-25562
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: yinghua_zh
>Priority: Minor
>
>  
> At present, spark does not record audit logs, and can increase audit logs for 
> security reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25633) Performance Improvement for Drools Spark Jobs.

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25633.
---
Resolution: Invalid

I can't make out a specific issue here. JIRA isn't for tech support questions; 
maybe the mailing lists. But asking someone to optimize your code isn't likely 
to get much response.

> Performance Improvement for Drools Spark Jobs.
> --
>
> Key: SPARK-25633
> URL: https://issues.apache.org/jira/browse/SPARK-25633
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: [link title|http:[link 
> title|http://example.com]//example.com][link title|http://example.com][link 
> title|http://example.com]
>Reporter: Koushik
>Priority: Major
> Attachments: RTTA Performance Issue.pptx
>
>
> We have below region wise compute instance on performance environment. when 
> we reduce the compute instances, we face performance issue
> we have already done code optimization..[link title|http://example.com]
>  
> |Region|Commute Instances on performance environment|
> |MWSW|6|
> |SE|6|
> |W|6|
> |Total|*18*|
>  
> for above combination 98% data process within 30 seconds but when we reduce 
> instances performance degrade.
>  
> we would provide all additional details to respective support team on request



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Babulal (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-27036:

Attachment: image-2019-03-04-00-39-38-779.png

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
> !image-2019-03-04-00-31-34-364.png!
>  Spark UI !image-2019-03-04-00-32-22-663.png!
>  wait for some time
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
> create Table t1(Big Table,1M Records)
> val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
> val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
> spark.sql("set spark.sql.broadcastTimeout=2").show(false)
> Run Below Query 
> spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
> as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-03 Thread Martin Loncaric (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782821#comment-16782821
 ] 

Martin Loncaric commented on SPARK-26555:
-

Yes - when I take away any randomness and use the same dataset every time (say, 
with Some(something) for each optional value), I still get this issue.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Babulal (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-27036:

Description: 
During broadcast table job is execution if broadcast timeout 
(spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
completion whereas it should abort on broadcast timeout.

Exception is thrown in console  but Spark Job is still continue.

 

!image-2019-03-04-00-39-38-779.png!

!image-2019-03-04-00-39-12-210.png!

 

 wait for some time

!image-2019-03-04-00-38-52-401.png!

!image-2019-03-04-00-34-47-884.png!

 

How to Reproduce Issue

Option1 using SQL:- 
 create Table t1(Big Table,1M Records)
 val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
("name_"+x,x%3,x))
 val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 
as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as 
c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as 
c27","_1 as c28","_1 as c29","_1 as c30")
 df.write.csv("D:/data/par1/t4");
 spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");

create Table t2(Small Table,100K records)
 val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
("name_"+x,x%3,x))
 val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 
as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as 
c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as 
c27","_1 as c28","_1 as c29","_1 as c30")
 df.write.csv("D:/data/par1/t4");
 spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");

spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
 spark.sql("set spark.sql.broadcastTimeout=2").show(false)
 Run Below Query 
 spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
as t2 where t1._c3=t2._c3")

Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
datasource for query.

  was:
During broadcast table job is execution if broadcast timeout 
(spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
completion whereas it should abort on broadcast timeout.

Exception is thrown in console  but Spark Job is still continue.

!image-2019-03-04-00-31-34-364.png!

 Spark UI !image-2019-03-04-00-32-22-663.png!

 wait for some time

!image-2019-03-04-00-34-47-884.png!

 

How to Reproduce Issue

Option1 using SQL:- 
create Table t1(Big Table,1M Records)
val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
("name_"+x,x%3,x))
val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as c1","_1 
as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as c8","_1 as 
c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 as c15","_1 
as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as c21","_1 as 
c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as c27","_1 as 
c28","_1 as c29","_1 as c30")
df.write.csv("D:/data/par1/t4");
spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");

create Table t2(Small Table,100K records)
val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
("name_"+x,x%3,x))
val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as c1","_1 
as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as c8","_1 as 
c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 as c15","_1 
as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as c21","_1 as 
c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as c27","_1 as 
c28","_1 as c29","_1 as c30")
df.write.csv("D:/data/par1/t4");
spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");

spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
spark.sql("set spark.sql.broadcastTimeout=2").show(false)
Run Below Query 
spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
as t2 where t1._c3=t2._c3")

Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
datasource for query.


> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
>

[jira] [Updated] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Babulal (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-27036:

Attachment: image-2019-03-04-00-39-12-210.png

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
> !image-2019-03-04-00-31-34-364.png!
>  Spark UI !image-2019-03-04-00-32-22-663.png!
>  wait for some time
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
> create Table t1(Big Table,1M Records)
> val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
> val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
> spark.sql("set spark.sql.broadcastTimeout=2").show(false)
> Run Below Query 
> spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
> as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Babulal (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-27036:

Attachment: image-2019-03-04-00-38-52-401.png

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
> !image-2019-03-04-00-31-34-364.png!
>  Spark UI !image-2019-03-04-00-32-22-663.png!
>  wait for some time
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
> create Table t1(Big Table,1M Records)
> val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
> val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
> val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
> df.write.csv("D:/data/par1/t4");
> spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
> spark.sql("set spark.sql.broadcastTimeout=2").show(false)
> Run Below Query 
> spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
> as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-03 Thread Babulal (JIRA)

Babulal created SPARK-27036:
---

 Summary: Even Broadcast thread is timed out, BroadCast Job is not 
aborted.
 Key: SPARK-27036
 URL: https://issues.apache.org/jira/browse/SPARK-27036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Babulal
 Attachments: image-2019-03-04-00-38-52-401.png

During broadcast table job is execution if broadcast timeout 
(spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
completion whereas it should abort on broadcast timeout.

Exception is thrown in console  but Spark Job is still continue.

!image-2019-03-04-00-31-34-364.png!

 Spark UI !image-2019-03-04-00-32-22-663.png!

 wait for some time

!image-2019-03-04-00-34-47-884.png!

 

How to Reproduce Issue

Option1 using SQL:- 
create Table t1(Big Table,1M Records)
val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
("name_"+x,x%3,x))
val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as c1","_1 
as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as c8","_1 as 
c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 as c15","_1 
as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as c21","_1 as 
c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as c27","_1 as 
c28","_1 as c29","_1 as c30")
df.write.csv("D:/data/par1/t4");
spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");

create Table t2(Small Table,100K records)
val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
("name_"+x,x%3,x))
val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as c1","_1 
as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as c8","_1 as 
c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as c14","_1 as c15","_1 
as c16","_1 as c17","_1 as c18","_1 as c19","_1 as c20","_1 as c21","_1 as 
c22","_1 as c23","_1 as c24","_1 as c25","_1 as c26","_1 as c27","_1 as 
c28","_1 as c29","_1 as c30")
df.write.csv("D:/data/par1/t4");
spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");

spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
spark.sql("set spark.sql.broadcastTimeout=2").show(false)
Run Below Query 
spark.sql("create table s using parquet as select t1.* from csv_2 as t1,csv_1 
as t2 where t1._c3=t2._c3")

Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25853) Parts of spark components (DAG Visualizationand executors page) not available in Internet Explorer

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25853.
---
Resolution: Won't Fix

It looks like recent versions of Internet Explorer, as Edge, do support this. 
Realistically I don't think we'd change the JS to work around this for old IE 
versions. It looks like Edge has supported this for 3.5 years: 
https://wpdev.uservoice.com/forums/257854-microsoft-edge-developer/suggestions/7314050-support-document-baseuri-property

> Parts of spark components (DAG Visualizationand executors page) not available 
> in Internet Explorer
> --
>
> Key: SPARK-25853
> URL: https://issues.apache.org/jira/browse/SPARK-25853
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0, 2.3.2
>Reporter: aastha
>Priority: Major
> Attachments: dag_error_ie.png, dag_not_rendered_ie.png, 
> dag_on_chrome.png, execuotrs_not_rendered_ie.png, executors_error_ie.png, 
> executors_on_chrome.png
>
>
> Spark UI has come limitations when working with Internet Explorer. The DAG 
> component as well as Executors page does not render, it works on Firefox and 
> Chrome. I have tested on recent Inter Explorer 11.483.15063.0. Since it works 
> on Chrome and Firefox their versions should not matter.
> For executors page, the root cause is that document.baseURI property is 
> undefined in Internet Explorer. When I debug by providing the property 
> myself, it shows up fine.
> For DAG component, developer tools haven't helped. 
> Attaching screenshots for Chrome and IE UI and debug console messages. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-03 Thread Martin Loncaric (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782821#comment-16782821
 ] 

Martin Loncaric edited comment on SPARK-26555 at 3/3/19 6:56 PM:
-

Yes - when I take away any randomness and use the same dataset every time (say, 
with Some(something) for each optional value), I still get this issue.

I've run this code in a couple different environments and obtained the same 
result, so you should be able to verify this as well.


was (Author: mwlon):
Yes - when I take away any randomness and use the same dataset every time (say, 
with Some(something) for each optional value), I still get this issue.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-03-03 Thread Gerard Alexander (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782815#comment-16782815
 ] 

Gerard Alexander commented on SPARK-26984:
--

Sean Owen: you are right of course it should be _None_, but there is code out 
there that assumes that Some(null) is possible I suspect.

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {code}
> val df = Seq(
>   (1, Some("a"), Some(1)),
>   (2, Some(null), Some(2)),
>   (3, Some("c"), Some(3)),
>   (4, None, None)).toDF("c1", "c2", "c3")}
> {code}
> In Spark 2.2.1 (on mapr) the {{Some(null)}} works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {code}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more
> {code}
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid resolved SPARK-23134.

Resolution: Duplicate

> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-23134:
---
Comment: was deleted

(was: Will resolve once the Jira SPARK-27012 merged)

> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782810#comment-16782810
 ] 

Sean Owen commented on SPARK-25863:
---

Returning 0 seems like the correct thing to do, locally. [~maropu] does that 
sound right? this particular line was added in 
https://github.com/apache/spark/commit/4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6

It seems like it could happen if you had errors computing stats. Do you see 
warnings like "Error calculating stats of compiled class"?

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cache, catalyst, code-generation
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at

[jira] [Reopened] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid reopened SPARK-23134:


> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid resolved SPARK-23134.

Resolution: Duplicate

> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782808#comment-16782808
 ] 

Sean Owen commented on SPARK-25982:
---

Can you clarify with a more complete example? what is running in parallel and 
what next stage of what starts executing?

> Dataframe write is non blocking in fair scheduling mode
> ---
>
> Key: SPARK-25982
> URL: https://issues.apache.org/jira/browse/SPARK-25982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Ramandeep Singh
>Priority: Major
>
> Hi,
> I have noticed that expected behavior of dataframe write operation to block 
> is not working in fair scheduling mode.
> Ideally when a dataframe write is occurring and a future is blocking on 
> AwaitResult, no other job should be started, but this is not the case. I have 
> noticed that other jobs are started when the partitions are being written.  
>  
> Regards,
> Ramandeep Singh
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26555) Thread safety issue causes createDataset to fail with misleading errors

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782806#comment-16782806
 ] 

Sean Owen commented on SPARK-26555:
---

To be clear, is there a data set that works only when not run in this 
concurrent loop? What I'm reading here is simply that you generate datasets 
sometimes that (correctly, I believe) can't have their schema inferred.

> Thread safety issue causes createDataset to fail with misleading errors
> ---
>
> Key: SPARK-26555
> URL: https://issues.apache.org/jira/browse/SPARK-26555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> This can be replicated (~2% of the time) with
> {code:scala}
> import java.sql.Timestamp
> import java.util.concurrent.{Executors, Future}
> import org.apache.spark.sql.SparkSession
> import scala.collection.mutable.ListBuffer
> import scala.concurrent.ExecutionContext
> import scala.util.Random
> object Main {
>   def main(args: Array[String]): Unit = {
> val sparkSession = SparkSession.builder
>   .getOrCreate()
> import sparkSession.implicits._
> val executor = Executors.newFixedThreadPool(1)
> try {
>   implicit val xc: ExecutionContext = 
> ExecutionContext.fromExecutorService(executor)
>   val futures = new ListBuffer[Future[_]]()
>   for (i <- 1 to 3) {
> futures += executor.submit(new Runnable {
>   override def run(): Unit = {
> val d = if (Random.nextInt(2) == 0) Some("d value") else None
> val e = if (Random.nextInt(2) == 0) Some(5.0) else None
> val f = if (Random.nextInt(2) == 0) Some(6.0) else None
> println("DEBUG", d, e, f)
> sparkSession.createDataset(Seq(
>   MyClass(new Timestamp(1L), "b", "c", d, e, f)
> ))
>   }
> })
>   }
>   futures.foreach(_.get())
> } finally {
>   println("SHUTDOWN")
>   executor.shutdown()
>   sparkSession.stop()
> }
>   }
>   case class MyClass(
> a: Timestamp,
> b: String,
> c: String,
> d: Option[String],
> e: Option[Double],
> f: Option[Double]
>   )
> }
> {code}
> So it will usually come up during
> {code:bash}
> for i in $(seq 1 200); do
>   echo $i
>   spark-submit --master local[4] target/scala-2.11/spark-test_2.11-0.1.jar
> done
> {code}
> causing a variety of possible errors, such as
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: scala.MatchError: scala.Option[String] (of class 
> scala.reflect.internal.Types$ClassArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:210){code}
> or
> {code}Exception in thread "main" java.util.concurrent.ExecutionException: 
> java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> Caused by: java.lang.UnsupportedOperationException: Schema for type 
> scala.Option[scala.Double] is not supported
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid reopened SPARK-23134:


Will resolve once the Jira SPARK-27012 merged

> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23134) WebUI is showing the cache table details even after cache idle timeout

2019-03-03 Thread shahid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid resolved SPARK-23134.

Resolution: Duplicate

> WebUI is showing the cache table details even after cache idle timeout
> --
>
> Key: SPARK-23134
> URL: https://issues.apache.org/jira/browse/SPARK-23134
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.2.1
> Environment:  Run Cache command with below configuration to cache the 
> RDD blocks
>   spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
>   spark.dynamicAllocation.executorIdleTimeout=60s
>   spark.dynamicAllocation.enabled=true
>   spark.dynamicAllocation.minExecutors=0
>   spark.dynamicAllocation.maxExecutors=8
>  
>  
>  
>Reporter: shahid
>Priority: Major
>
> After cachedExecutorIdleTimeout, WebUI shows the cached partition details in 
> the storage tab. It should be the same scenario as in the case of uncache 
> table, where the storage tab of the web UI shows "RDD not found".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782803#comment-16782803
 ] 

Sean Owen commented on SPARK-26881:
---

[~gagafunctor] would you like to open a pull request? I think the heuristic is 
valuable.

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26906) Pyspark RDD Replication Potentially Not Working

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26906.
---
Resolution: Cannot Reproduce

> Pyspark RDD Replication Potentially Not Working
> ---
>
> Key: SPARK-26906
> URL: https://issues.apache.org/jira/browse/SPARK-26906
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 2.3.2
> Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
> 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
>  (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
> python version 3.7. PySpark shell is activated using pyspark --num-executors 
> = 100
>Reporter: Han Altae-Tran
>Priority: Minor
> Attachments: spark_ui.png
>
>
> Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
> simple example, the UI reports only 1x replication, despite using the flag 
> for 2x replication
> {code:java}
> rdd = sc.range(10**9)
> mapped = rdd.map(lambda x: x)
> mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
> PythonRDD.scala:52
> mapped.count(){code}
>  
> Interestingly, if you catch the UI page at just the right time, you see that 
> it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the 
> RDD is replicated, but it is just the UI that is unable to register this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26980) Kryo deserialization not working with KryoSerializable class

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26980.
---
Resolution: Not A Problem

Yes, my guess is it's because you're using Spark's Kryo and config, and it has 
class registration enabled. It tracks the class of a serialized object with a 
number, IIRC. If you're using Kryo directly within Spark, you'd have to take 
account of Spark's Kryo environment. 

Can you just let Spark serialize these things with Kryo? or let Kryo deal with 
its encoding? it is sounding like you're using class registration but then 
telling it the class of the thing you are deserializing.

I'm going to close this. To reopen, I'd say at least you need to demonstrate 
that your code works correctly outside Spark with Kryo class registration

> Kryo deserialization not working with KryoSerializable class
> 
>
> Key: SPARK-26980
> URL: https://issues.apache.org/jira/browse/SPARK-26980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local Spark v2.4.0
> Kotlin v1.3.21
>Reporter: Alexis Sarda-Espinosa
>Priority: Minor
>  Labels: kryo, serialization
>
> I'm trying to create an {{Aggregator}} that uses a custom container that 
> should be serialized with {{Kryo:}} 
> {code:java}
> class StringSet(other: Collection) : HashSet(other), 
> KryoSerializable {
> companion object {
> @JvmStatic
> private val serialVersionUID = 1L
> }
> constructor() : this(Collections.emptyList())
> override fun write(kryo: Kryo, output: Output) {
> output.writeInt(this.size)
> for (string in this) {
> output.writeString(string)
> }
> }
> override fun read(kryo: Kryo, input: Input) {
> val size = input.readInt()
> repeat(size) { this.add(input.readString()) }
> }
> }
> {code}
> However, if I look at the corresponding value in the {{Row}} after 
> aggregation (for example by using {{collectAsList()}}), I see a {{byte[]}}. 
> Interestingly, the first byte in that array seems to be some sort of noise, 
> and I can deserialize by doing something like this: 
> {code:java}
> val b = row.getAs(2)
> val input = Input(b.copyOfRange(1, b.size)) // extra byte?
> val set = Kryo().readObject(input, StringSet::class.java)
> {code}
> Used configuration: 
> {code:java}
> SparkConf()
> .setAppName("Hello Spark with Kotlin")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.kryo.registrationRequired", "true")
> .registerKryoClasses(arrayOf(StringSet::class.java))
> {code}
> [Sample repo with all the 
> code|https://github.com/asardaes/hello-spark-kotlin/tree/8e8a54fd81f0412507318149841c69bb17d8572c].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26991) Investigate difference of `returnNullable` between ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782796#comment-16782796
 ] 

Sean Owen commented on SPARK-26991:
---

This is a philosophical question about what JIRA is for. JIRA was built for 
short-lived issues that have a resolution, because it's built around a workflow 
from open to closed. It means JIRAs ought to be actionable and specific, so 
that we know what closed means and can imagine getting it there.

Of course, there is a role for open-ended ideas and long-term goals. JIRA isn't 
great for that, but it's the only tool we have. JIRA has some tools for this 
type of thing, but we aren't able to control the workflow settings enough to 
use them. In theory that means, get another tool, not discourage open-ended 
ideas.

In practice, we have an overwhelming number of JIRAs that don't fit what it's 
good for: underspecified issues, sketches of major changes, todos, etc -- 
things that either aren't actionable, are 'actionable' but just not a good 
idea, placeholders that will never be followed up on.

This is why I don't think "let's investigate" JIRAs make much sense in theory 
or practice. Investigate, sure, then track the action that falls out in a JIRA. 
I don't object to this one given your clarification. My experience is that 
these don't get followed up on, they get left open, or they feel like a 
substitute for doing the investigation -- i.e. let me write this down for 
someone else to think about, which virtually never ever happens

> Investigate difference of `returnNullable` between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor
> 
>
> Key: SPARK-26991
> URL: https://issues.apache.org/jira/browse/SPARK-26991
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on investigation on difference between 
> ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor, 
> especially the reason why Java side uses `returnNullable = true` whereas 
> `returnNullable = false`.
> The origin discussion is linked here:
> https://github.com/apache/spark/pull/23854#discussion_r260117702



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782789#comment-16782789
 ] 

Sean Owen commented on SPARK-27025:
---

It's an interesting question; let's break it down.

Calling toLocalIterator on an RDD of N partitions actually runs N jobs to 
compute the partitions individually. That's fine except you wait for the next 
partition job to complete after consuming the last one's iterator.

You could cache the RDD (essential here anyway) and materialize it all first 
with count() or something, then run toLocalIterator. That more or less 
eliminates this delay and ensures you only have one partition of data on the 
driver at a time. Yes it means you persist the RDD. That's actually vital for 
an RDD from a wide transformation; you absolutely don't want to recompute the 
whole thing N times. For a narrow transform, OK, per-partition computation is 
in theory no more work than computing it once in one go, even without caching.

Of course, this also means you don't start iterating at all until all are 
partitions are done. In some cases you can't do better anyway (e.g. a wide 
transformation where all partitions have to be computed at once anyway). But 
then again, even for narrow transforms, the wall-clock time to compute 1 
partition is about the same for all partitions. You'd wait as long for 1 to 
finish as for N, assuming they're fairly equally sized tasks.

toLocalIterator could also compute the partitions in parallel on the driver. 
But this more or less reduces to collect(), as all the results might arrive on 
the driver before they're consumed.

It could, say, compute partitions in parallel in a way that partition N+1 is 
started as soon as the job for N finishes. That's not too hard even, but now we 
have up to 2 partitions' worth of data on the driver instead of 1. There's a 
tradeoff there, in complexity and extra driver memory, but it's coherent.

This is even implementable now in your code if you want to try it; just call 
sc.runJob directly like toLocalIterator does and add the fetch-ahead logic. Do 
you even care about consuming the results in order, or just iterating over the 
partitions' results as soon as each is available? if doing it in order isn't 
required, this is even better than a parallel toLocalIterator. You run back 
into the issue that all the data might arrive on the driver at one time; if 
that's an issue here this probably won't fly. If it's not an issue, this 
probably doesn't add a lot over just collect()-ing but it's possible.

I'm not against trying the 2-partition implementation of toLocalIterator, but 
think the use case for it is limited, given that many scenarios have better or 
no-worse solutions already, per above.

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-03 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782778#comment-16782778
 ] 

Maxim Gekk commented on SPARK-26016:


> nothing reinterprets the bytes according to a different encoding?

Correct

> The underlying Hadoop impl does interpret the bytes as UTF-8 (skipping of 
> BOMs, etc) ...

Hadoop's LineReader does not decode input bytes. It just copy bytes between 
line delimiters in 
https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L257
 by using Text.append 
(https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L335):
{code:java}
  public void append(byte[] utf8, int start, int len) {
setCapacity(length + len, true);
System.arraycopy(utf8, start, bytes, length, len);
length += len;
  }
{code}

Spark actually never checks correctness of UTF8String input. I even created a 
few tickets for that:
https://issues.apache.org/jira/browse/SPARK-23741
https://issues.apache.org/jira/browse/SPARK-23649

Adding such checks could bring some performance degradation most likely.

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26620) DataFrameReader.json and csv in Python should accept DataFrame.

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26620.
---
Resolution: Not A Problem

> DataFrameReader.json and csv in Python should accept DataFrame.
> ---
>
> Key: SPARK-26620
> URL: https://issues.apache.org/jira/browse/SPARK-26620
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently {{DataFrameReader.json()}} and {{csv()}} in Python doesn't accept 
> {{DataFrame}}, but they should accept if the schema of the given 
> {{DataFrame}} contains only one String column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26610) Fix inconsistency between toJSON Method in Python and Scala

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26610.
---
Resolution: Not A Problem

> Fix inconsistency between toJSON Method in Python and Scala
> ---
>
> Key: SPARK-26610
> URL: https://issues.apache.org/jira/browse/SPARK-26610
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> {{DataFrame.toJSON()}} in PySpark should return {{DataFrame}} of JSON string 
> instead of {{RDD}}. The method in Scala/Java was changed to return 
> {{DataFrame}} before, but the one in PySpark was not changed at that time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782774#comment-16782774
 ] 

Sean Owen commented on SPARK-26016:
---

[~maxgekk] I see, but am I correct that in the text source, nothing 
reinterprets the bytes according to a different encoding? that's coherent if 
that's the intended behavior for now, and 'encoding' is really a hidden option 
that isn't meant to be used with the text source.

But I'm wondering, does this actually work in all cases for JSON? The 
underlying Hadoop impl does interpret the bytes as UTF-8 (skipping of BOMs, 
etc) even if it doesn't appear to try to parse it as UTF-8. It might happen to 
work unless some other encoding puts BOM bytes at the start.

Since you may know this part better than I -- is adding encoding support to the 
text reader probably as easy as parsing the bytes as a string an re-encoding 
them as UTF-8 right around 
https://github.com/apache/spark/blob/36a2e6371b4d173c3e03cc0d869c39335a0d7682/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L135
 ? That's going to be a perf hit to be sure, and only happens if it's not 
already UTF-8.

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26918) All .md should have ASF license header

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-26918:
---

Huh, OK. I had thought all these headers were actually conveniences, and 
technically redundant with the top-level LICENSE. But practically useful in 
that someone copying the source files always gets the right license to go with 
it. And that doesn't really come up for docs. But sure yeah add them if that's 
explicit policy now.

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24778) DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed

2019-03-03 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-24778.

   Resolution: Fixed
Fix Version/s: 3.0.0

The issue has been fixed already by using ZoneId.of for parsing time zone ids.

> DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed
> --
>
> Key: SPARK-24778
> URL: https://issues.apache.org/jira/browse/SPARK-24778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Vinitha Reddy Gankidi
>Priority: Major
> Fix For: 3.0.0
>
>
> {{DateTimeUtils.getTimeZone}} calls java's {{Timezone.getTimezone}} method 
> that defaults to GMT if the timezone cannot be parsed. This can be misleading 
> for users and its better to return NULL instead of returning an incorrect 
> value. 
> To reproduce: {{from_utc_timestamp}} is one of the functions that calls 
> {{DateTimeUtils.getTimeZone}}. Session timezone is GMT for the following 
> queries.
> {code:java}
> SELECT from_utc_timestamp('2018-07-10 12:00:00', 'GMT+05:00') -> 2018-07-10 
> 17:00:00 
> SELECT from_utc_timestamp('2018-07-10 12:00:00', '+05:00') -> 2018-07-10 
> 12:00:00 (Defaults to GMT as the timezone is not recognized){code}
> We could fix it by using the workaround mentioned here: 
> [https://bugs.openjdk.java.net/browse/JDK-4412864].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27016) Treat all antlr warnings as errors while generating parser from the sql grammar file.

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27016.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23925
[https://github.com/apache/spark/pull/23925]

> Treat all antlr warnings as errors while generating parser from the sql 
> grammar file.
> -
>
> Key: SPARK-27016
> URL: https://issues.apache.org/jira/browse/SPARK-27016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> Use the maven plugin option `*treatWarningsAsErrors`* to make sure the 
> warnings are treated as errors while generating the parser file. In the 
> absence of it, we may inadvertently introducing problems while making grammar 
> changes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27016) Treat all antlr warnings as errors while generating parser from the sql grammar file.

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27016:
-

Assignee: Dilip Biswal

> Treat all antlr warnings as errors while generating parser from the sql 
> grammar file.
> -
>
> Key: SPARK-27016
> URL: https://issues.apache.org/jira/browse/SPARK-27016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>
> Use the maven plugin option `*treatWarningsAsErrors`* to make sure the 
> warnings are treated as errors while generating the parser file. In the 
> absence of it, we may inadvertently introducing problems while making grammar 
> changes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26274) Download page must link to https://www.apache.org/dist/spark for current releases

2019-03-03 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26274.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.3.3

> Download page must link to https://www.apache.org/dist/spark for current 
> releases
> -
>
> Key: SPARK-26274
> URL: https://issues.apache.org/jira/browse/SPARK-26274
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sebb
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.3
>
>
> The download page currently uses the archive server:
> https://archive.apache.org/dist/spark/...
> for all sigs and hashes.
> This is fine for archived releases, however current ones must link to the 
> mirror system, i.e.
> https://www.apache.org/dist/spark/...
> Also, the page does not link directly to the hash or sig.
> This makes it very difficult for the user, as they have to choose the correct 
> file.
> The download page must link directly to the actual sig or hash.
> Ideally do so for the archived releases as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2019-03-03 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782759#comment-16782759
 ] 

Sean Owen commented on SPARK-26146:
---

Wait a sec, here's the explanation: 
https://issues.apache.org/jira/browse/SPARK-26583
This is why the dependency graph is different; it was fixed after 2.4.0.
Making Spark a provided dependency is the right practice anyway (going to add 
docs about that) and might also work around this. 

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26795) Retry remote fileSegmentManagedBuffer when creating inputStream failed during shuffle read phase

2019-03-03 Thread Mohamed Mehdi BEN AISSA (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782748#comment-16782748
 ] 

Mohamed Mehdi BEN AISSA commented on SPARK-26795:
-

I have the same issue with spark 2.3.0:
{code:java}
org.apache.spark.shuffle.FetchFailedException 

{code}
AND
{code:java}
 Error opening block StreamChunkId {streamId=1377556883266, chunkIndex=9} for 
request from /ip.adress:39050
 java.io.EOFException..
{code}
Please, how did you fix this issue ? we should wait for the patch ?

> Retry remote fileSegmentManagedBuffer when creating inputStream failed during 
> shuffle read phase
> 
>
> Key: SPARK-26795
> URL: https://issues.apache.org/jira/browse/SPARK-26795
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: feiwang
>Priority: Major
>
> There is a parameter spark.maxRemoteBlockSizeFetchToMem, which means the 
> remote block will be fetched to disk when size of the block is above this 
> threshold in bytes.
> So during shuffle read phase, the managedBuffer which throw IOException may 
> be a remote downloaded FileSegment and should be retried instead of 
> throwFetchFailed directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24346) Executors are unable to fetch remote cache blocks

2019-03-03 Thread Mohamed Mehdi BEN AISSA (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782739#comment-16782739
 ] 

Mohamed Mehdi BEN AISSA commented on SPARK-24346:
-

Many thanks [~kien_truong] !  

Speculation also can resolve the issue but as you said, we have to find the 
root cause of this issue..

 

> Executors are unable to fetch remote cache blocks
> -
>
> Key: SPARK-24346
> URL: https://issues.apache.org/jira/browse/SPARK-24346
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
> Environment: OS: Centos 7.3
> Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0
>Reporter: Truong Duc Kien
>Priority: Major
>
> After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a 
> massive performance hit because executors become unable to fetch remote cache 
> block from each others. The scenario is:
> 1. An executor creates a connection and sends a ChunkFetchRequest message to 
> another executor. 
> 2. This request arrives at the target executor, which sends back a 
> ChunkFetchSuccess response
> 3. The ChunkFetchSuccess msg never arrives.
> 4. The connection between these two executors is killed by the originating 
> executor after 120s of idleness. At the same time, the other executor report 
> that it failed to send the ChunkFetchSuccess because the pipe is closed.
> This process repeats itself 3 times, delaying our jobs by 6 minutes, then the 
> originating executor decides to stop fetching and calculates the block by 
> itself and the job can continue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24346) Executors are unable to fetch remote cache blocks

2019-03-03 Thread Truong Duc Kien (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782723#comment-16782723
 ] 

Truong Duc Kien commented on SPARK-24346:
-

We never got to find out the cause of this problem.

But it doesn't seem to happen it you cache the data to disk instead of memory
{code:java}
// use
df.persist(StorageLevel.DISK_ONLY)
// instead of
df.cache(){code}
 

For some job, we just disable caching altogether. Caching can actually slow 
down some jobs due to reduced concurrency.

> Executors are unable to fetch remote cache blocks
> -
>
> Key: SPARK-24346
> URL: https://issues.apache.org/jira/browse/SPARK-24346
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
> Environment: OS: Centos 7.3
> Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0
>Reporter: Truong Duc Kien
>Priority: Major
>
> After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a 
> massive performance hit because executors become unable to fetch remote cache 
> block from each others. The scenario is:
> 1. An executor creates a connection and sends a ChunkFetchRequest message to 
> another executor. 
> 2. This request arrives at the target executor, which sends back a 
> ChunkFetchSuccess response
> 3. The ChunkFetchSuccess msg never arrives.
> 4. The connection between these two executors is killed by the originating 
> executor after 120s of idleness. At the same time, the other executor report 
> that it failed to send the ChunkFetchSuccess because the pipe is closed.
> This process repeats itself 3 times, delaying our jobs by 6 minutes, then the 
> originating executor decides to stop fetching and calculates the block by 
> itself and the job can continue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24346) Executors are unable to fetch remote cache blocks

2019-03-03 Thread Mohamed Mehdi BEN AISSA (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782678#comment-16782678
 ] 

Mohamed Mehdi BEN AISSA commented on SPARK-24346:
-

Any news !?  I have exactly the same issue in the same context (HDP version) :

ERROR TransportRequestHandler: Error opening block 
StreamChunkId\{streamId=1377556883266, chunkIndex=9} for request from 
/10.147.167.40:39050

java.io.EOFException

    at java.io.DataInputStream.readFully(DataInputStream.java:197)

    at java.io.DataInputStream.readLong(DataInputStream.java:416)

    at 
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:209)

    at 
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:375)

    at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:61)

    at 
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$1.apply(NettyBlockRpcServer.scala:60)

    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

    at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)

    at 
org.apache.spark.network.server.OneForOneStreamManager.getChunk(OneForOneStreamManager.java:92)

    at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:137)

    at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)

    at 
org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)

    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)

    at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)

    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)

    at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)

    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)

    at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)

    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)

    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)

    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)

    at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)

    at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)

    at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

    at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

    at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

    at java.lang.Thread.run(Thread.java:745)

 

 

> Executors are unable to fetch remote cache blocks
> -
>
> Key: SPARK-24346
> URL: https://issues.apache.org/jira/browse/SPARK-24346
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.0
> Environment: OS: Centos 7.3
> Cluster: Hortonwork

[jira] [Commented] (SPARK-24346) Executors are unable to fetch remote cache blocks

2019-03-03 Thread Mohamed Mehdi BEN AISSA (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782679#comment-16782679
 ] 

Mohamed Mehdi BEN AISSA commented on SPARK-24346:
-

org.apache.spark.shuffle.FetchFailedException at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:523)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:454)
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) 
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.sort_addToSorter$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$12$$anon$2.hasNext(WholeStageCodegenExec.scala:633)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.findNextInnerJoinRows$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$12$$anon$2.hasNext(WholeStageCodegenExec.scala:633)
 at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:793)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:754)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:916)
 at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:952)
 at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage16.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage35.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at

[jira] [Assigned] (SPARK-27035) Current time with microsecond resolution

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27035:


Assignee: (was: Apache Spark)

> Current time with microsecond resolution
> 
>
> Key: SPARK-27035
> URL: https://issues.apache.org/jira/browse/SPARK-27035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently,  the CurrentTimestamp expression uses 
> [System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
>  to take current time with millisecond resolution. The Instant.now allows 
> potentially to take current time with microsecond resolution: 
> [https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to 
> replace *System.currentTimeMillis()* by *Instant.now()*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27035) Current time with microsecond resolution

2019-03-03 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27035:


Assignee: Apache Spark

> Current time with microsecond resolution
> 
>
> Key: SPARK-27035
> URL: https://issues.apache.org/jira/browse/SPARK-27035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently,  the CurrentTimestamp expression uses 
> [System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
>  to take current time with millisecond resolution. The Instant.now allows 
> potentially to take current time with microsecond resolution: 
> [https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to 
> replace *System.currentTimeMillis()* by *Instant.now()*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27035) Current time with microsecond resolution

2019-03-03 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-27035:
---
Description: Currently,  the CurrentTimestamp expression uses 
[System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
 to take current time with millisecond resolution. The Instant.now allows 
potentially to take current time with microsecond resolution: 
[https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to replace 
*System.currentTimeMillis()* by *Instant.now()*.  (was: Currently,  the 
CurrentTimestamp expression uses 
[System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
 to take current time with millisecond resolution. The Instant.now allows 
potentially to take current time with microsecond resolution: 
[https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to replace 
System.currentTimeMillis()| by Instant.now().)

> Current time with microsecond resolution
> 
>
> Key: SPARK-27035
> URL: https://issues.apache.org/jira/browse/SPARK-27035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently,  the CurrentTimestamp expression uses 
> [System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
>  to take current time with millisecond resolution. The Instant.now allows 
> potentially to take current time with microsecond resolution: 
> [https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to 
> replace *System.currentTimeMillis()* by *Instant.now()*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27035) Current time with microsecond resolution

2019-03-03 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-27035:
--

 Summary: Current time with microsecond resolution
 Key: SPARK-27035
 URL: https://issues.apache.org/jira/browse/SPARK-27035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently,  the CurrentTimestamp expression uses 
[System.currentTimeMillis()|[https://github.com/apache/spark/blob/a2a41b7bf2bfdcd1cff242013716ac7bd84bdacd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L99]]
 to take current time with millisecond resolution. The Instant.now allows 
potentially to take current time with microsecond resolution: 
[https://bugs.openjdk.java.net/browse/JDK-8068730] . The ticket aims to replace 
System.currentTimeMillis()| by Instant.now().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 105 matches

Mail list logo