[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"

2023-06-20 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-44108:
--
Description: 
Hello all, 

 

I have a client that has a column named : bfzgtäeil

Spark cannot handle this. My test: 

 
{code:java}
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.scalatest.funsuite.AnyFunSuite

class HiveTest extends AnyFunSuite {

  test("test that Spark does not cut columns with ä") {
val data =
  "bfzugtäeil:string"
CatalystSqlParser.parseDataType(data)
  }

} {code}
I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 

Any ideas ? 

 
{code:java}
== SQL ==bfzugtäeil:string--^^^
at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
   at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41)
   at 
com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13)
 {code}

  was:
Hello all, 

 

I have a client that has a column named : bfzgtäeil

Spark cannot handle this. My test: 

 
{code:java}
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.scalatest.funsuite.AnyFunSuite

class HiveTest extends AnyFunSuite {

  test("test that Spark does not cut columns with ä") {
val data =
  "bfzugtäeil:string"
CatalystSqlParser.parseDataType(data)
  }

} {code}
I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 

Any ideas ? 


> Cannot parse Type from german "umlaut"
> --
>
> Key: SPARK-44108
> URL: https://issues.apache.org/jira/browse/SPARK-44108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hello all, 
>  
> I have a client that has a column named : bfzgtäeil
> Spark cannot handle this. My test: 
>  
> {code:java}
> import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
> import org.scalatest.funsuite.AnyFunSuite
> class HiveTest extends AnyFunSuite {
>   test("test that Spark does not cut columns with ä") {
> val data =
>   "bfzugtäeil:string"
> CatalystSqlParser.parseDataType(data)
>   }
> } {code}
> I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 
> Any ideas ? 
>  
> {code:java}
> == SQL ==bfzugtäeil:string--^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
>at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41)
>at 
> com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"

2023-06-20 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-44108:
--
Description: 
Hello all, 

 

I have a client that has a column named : bfzgtäeil

Spark cannot handle this. My test: 

 
{code:java}
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.scalatest.funsuite.AnyFunSuite

class HiveTest extends AnyFunSuite {

  test("test that Spark does not cut columns with ä") {
val data = "bfzugtäeil:string"
CatalystSqlParser.parseDataType(data)
  }

} {code}
I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 

Any ideas ? 

 
{code:java}
== SQL ==bfzugtäeil:string--^^^
at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
   at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41)
   at 
com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13)
 {code}

  was:
Hello all, 

 

I have a client that has a column named : bfzgtäeil

Spark cannot handle this. My test: 

 
{code:java}
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.scalatest.funsuite.AnyFunSuite

class HiveTest extends AnyFunSuite {

  test("test that Spark does not cut columns with ä") {
val data =
  "bfzugtäeil:string"
CatalystSqlParser.parseDataType(data)
  }

} {code}
I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 

Any ideas ? 

 
{code:java}
== SQL ==bfzugtäeil:string--^^^
at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
   at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41)
   at 
com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13)
 {code}


> Cannot parse Type from german "umlaut"
> --
>
> Key: SPARK-44108
> URL: https://issues.apache.org/jira/browse/SPARK-44108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hello all, 
>  
> I have a client that has a column named : bfzgtäeil
> Spark cannot handle this. My test: 
>  
> {code:java}
> import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
> import org.scalatest.funsuite.AnyFunSuite
> class HiveTest extends AnyFunSuite {
>   test("test that Spark does not cut columns with ä") {
> val data = "bfzugtäeil:string"
> CatalystSqlParser.parseDataType(data)
>   }
> } {code}
> I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 
> Any ideas ? 
>  
> {code:java}
> == SQL ==bfzugtäeil:string--^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
>at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:143)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseDataType(ParseDriver.scala:41)
>at 
> com.deutschebahn.zod.fvdl.commons.spark.app.captured.HiveTest2.$anonfun$new$1(HiveTest2.scala:13)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"

2023-06-20 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-44108:
--
Priority: Major  (was: Minor)

> Cannot parse Type from german "umlaut"
> --
>
> Key: SPARK-44108
> URL: https://issues.apache.org/jira/browse/SPARK-44108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hello all, 
>  
> I have a client that has a column named : bfzgtäeil
> Spark cannot handle this. My test: 
>  
> {code:java}
> import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
> import org.scalatest.funsuite.AnyFunSuite
> class HiveTest extends AnyFunSuite {
>   test("test that Spark does not cut columns with ä") {
> val data =
>   "bfzugtäeil:string"
> CatalystSqlParser.parseDataType(data)
>   }
> } {code}
> I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 
> Any ideas ? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44108) Cannot parse Type from german "umlaut"

2023-06-20 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-44108:
--
Priority: Minor  (was: Critical)

> Cannot parse Type from german "umlaut"
> --
>
> Key: SPARK-44108
> URL: https://issues.apache.org/jira/browse/SPARK-44108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jorge Machado
>Priority: Minor
>
> Hello all, 
>  
> I have a client that has a column named : bfzgtäeil
> Spark cannot handle this. My test: 
>  
> {code:java}
> import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
> import org.scalatest.funsuite.AnyFunSuite
> class HiveTest extends AnyFunSuite {
>   test("test that Spark does not cut columns with ä") {
> val data =
>   "bfzugtäeil:string"
> CatalystSqlParser.parseDataType(data)
>   }
> } {code}
> I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 
> Any ideas ? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44108) Cannot parse Type from german "umlaut"

2023-06-20 Thread Jorge Machado (Jira)
Jorge Machado created SPARK-44108:
-

 Summary: Cannot parse Type from german "umlaut"
 Key: SPARK-44108
 URL: https://issues.apache.org/jira/browse/SPARK-44108
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Jorge Machado


Hello all, 

 

I have a client that has a column named : bfzgtäeil

Spark cannot handle this. My test: 

 
{code:java}
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.scalatest.funsuite.AnyFunSuite

class HiveTest extends AnyFunSuite {

  test("test that Spark does not cut columns with ä") {
val data =
  "bfzugtäeil:string"
CatalystSqlParser.parseDataType(data)
  }

} {code}
I debugged it and I'm deep on the  org.antlr.v4.runtime.Lexer class. 

Any ideas ? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33772) Build and Run Spark on Java 17

2023-01-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941
 ] 

Jorge Machado edited comment on SPARK-33772 at 1/3/23 12:27 PM:


I still have an issue with this. Running sbt test fails and I don't know why. 
{code:java}
An exception or error caused a run to abort: class 
org.apache.spark.storage.StorageUtils$ (in unnamed module @0x1aa7ecca) cannot 
access class sun.nio.ch.DirectBuffer (in module java.base) because module 
java.base does not export sun.nio.ch to unnamed module @0x1aa7ecca 
 {code}


was (Author: jomach):
I still have an issue with this. Running sbt test fails and I don't know why. 

[error] Uncaught exception when running .LocalViewCreatorTest: 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.storage.StorageUtils$
23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on 
port 49902.
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.spark.storage.StorageUtils$
[error]         at 
org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114)

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Yang Jie
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|
> Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along 
> with the release branch cut.
> - https://spark.apache.org/versioning-policy.html
> Supporting new Java version is considered as a new feature which we cannot 
> allow to backport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33772) Build and Run Spark on Java 17

2023-01-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941
 ] 

Jorge Machado edited comment on SPARK-33772 at 1/3/23 12:27 PM:


I still have an issue with this. Running sbt test fails and I don't know why. 

[error] Uncaught exception when running .LocalViewCreatorTest: 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.storage.StorageUtils$
23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on 
port 49902.
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.spark.storage.StorageUtils$
[error]         at 
org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114)


was (Author: jomach):
I still have an issue with this. Running sbt test fails and I don't know why. 

[error] Uncaught exception when running 
com.deutschebahn.zod.fvdl.commons.aws.athena.LocalViewCreatorTest: 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.storage.StorageUtils$
23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on 
port 49902.
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.spark.storage.StorageUtils$
[error]         at 
org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114)

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Yang Jie
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|
> Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along 
> with the release branch cut.
> - https://spark.apache.org/versioning-policy.html
> Supporting new Java version is considered as a new feature which we cannot 
> allow to backport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33772) Build and Run Spark on Java 17

2023-01-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653941#comment-17653941
 ] 

Jorge Machado commented on SPARK-33772:
---

I still have an issue with this. Running sbt test fails and I don't know why. 

[error] Uncaught exception when running 
com.deutschebahn.zod.fvdl.commons.aws.athena.LocalViewCreatorTest: 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.storage.StorageUtils$
23/01/03 12:16:40 INFO Utils: Successfully started service 'sparkDriver' on 
port 49902.
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.spark.storage.StorageUtils$
[error]         at 
org.apache.spark.storage.BlockManagerMasterEndpoint.(BlockManagerMasterEndpoint.scala:114)

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Yang Jie
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|
> Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along 
> with the release branch cut.
> - https://spark.apache.org/versioning-policy.html
> Supporting new Java version is considered as a new feature which we cannot 
> allow to backport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31683) Make Prometheus output consistent with DropWizard 4.1 result

2020-06-08 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128625#comment-17128625
 ] 

Jorge Machado commented on SPARK-31683:
---

It would be great if we could use the RPC backend from spark to be able to 
aggregate this into the driver only This way we only need to scrape the driver. 
I have made an implementation based on yours that registers it to consul. This 
way we can discover the yarn applications via consul for example... 

> Make Prometheus output consistent with DropWizard 4.1 result
> 
>
> Key: SPARK-31683
> URL: https://issues.apache.org/jira/browse/SPARK-31683
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-29032 adds Prometheus support.
> After that, SPARK-29674 upgraded DropWizard for JDK9+ support and causes 
> difference in output labels and number of keys.
>  
> This issue aims to fix this inconsistency in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31683) Make Prometheus output consistent with DropWizard 4.1 result

2020-06-05 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126671#comment-17126671
 ] 

Jorge Machado commented on SPARK-31683:
---

[~dongjoon] this only exporters the metrics from the driver right?

> Make Prometheus output consistent with DropWizard 4.1 result
> 
>
> Key: SPARK-31683
> URL: https://issues.apache.org/jira/browse/SPARK-31683
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-29032 adds Prometheus support.
> After that, SPARK-29674 upgraded DropWizard for JDK9+ support and causes 
> difference in output labels and number of keys.
>  
> This issue aims to fix this inconsistency in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26902) Support java.time.Instant as an external type of TimestampType

2020-04-17 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085759#comment-17085759
 ] 

Jorge Machado commented on SPARK-26902:
---

what about Supporting the interface Temporal ?

> Support java.time.Instant as an external type of TimestampType
> --
>
> Key: SPARK-26902
> URL: https://issues.apache.org/jira/browse/SPARK-26902
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark supports the java.sql.Date and java.sql.Timestamp types as 
> external types for Catalyst's DateType and TimestampType. It accepts and 
> produces values of such types. Since Java 8, base classes for dates and 
> timestamps are java.time.Instant, java.time.LocalDate/LocalDateTime, and 
> java.time.ZonedDateTime. Need to add new converters from/to Instant.
> The Instant type holds epoch seconds (and nanoseconds), and directly reflects 
> to Catalyst's TimestampType.
> Main motivations for the changes:
> - Smoothly support Java 8 time API
> - Avoid inconsistency of calendars used inside Spark 3.0 (Proleptic Gregorian 
> calendar) and inside of java.sql.Timestamp (hybrid calendar - Julian + 
> Gregorian). 
> - Make conversion independent from current system timezone.
> In case of collecting values of Date/TimestampType, the following SQL config 
> can control types of returned values:
>  - spark.sql.catalyst.timestampType with supported values 
> "java.sql.Timestamp" (by default) and "java.time.Instant"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27

2020-03-30 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070774#comment-17070774
 ] 

Jorge Machado commented on SPARK-30272:
---

I failed to fix the guava stuff of course ... Today morning I tried to 
replicate the problem of the missing azure-hadoop jar but It seems to be 
working without any patch from my side. . I assume that I did something wrong 
on build.  Just for reference my steps: 
{code:java}
git checkout v3.0.0-preview-rc2
./build/mvn clean package -DskipTests -Phadoop-3.2 -Pkubernetes 
-Phadoop-cloud
./bin/docker-image-tool.sh -r docker.io/myrepo -t v2.3.0 -p 
kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
 docker run --rm -it myrepo/spark:v2.3.0 bash
 185@57fb3dd68902:/opt/spark/jars$ ls -altr *azure*
-rw-r--r-- 1 root root  67314 Mar 28 17:15 
hadoop-azure-datalake-3.2.0.jar
-rw-r--r-- 1 root root 480512 Mar 28 17:15 hadoop-azure-3.2.0.jar
-rw-r--r-- 1 root root 812977 Mar 28 17:15 azure-storage-7.0.0.jar
-rw-r--r-- 1 root root  10288 Mar 28 17:15 azure-keyvault-core-1.0.0.jar
-rw-r--r-- 1 root root  94061 Mar 28 17:15 
azure-data-lake-store-sdk-2.2.9.jar
{code}
As you see the hadoop-azure is there but not on version 3.2.1 but I guess this 
is a matter of updating the pom. 

 

> Remove usage of Guava that breaks in Guava 27
> -
>
> Key: SPARK-30272
> URL: https://issues.apache.org/jira/browse/SPARK-30272
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> Background:
> https://issues.apache.org/jira/browse/SPARK-29250
> https://github.com/apache/spark/pull/25932
> Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods 
> that changed between those releases, typically just a rename, but, means one 
> set of code can't work with both, while we want to work with Hadoop 2.x and 
> 3.x. Among them:
> - Objects.toStringHelper was moved to MoreObjects; we can just use the 
> Commons Lang3 equivalent
> - Objects.hashCode etc were renamed; use java.util.Objects equivalents
> - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread 
> execution we can use a dummy implementation of ExecutorService / Executor
> - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection
> There is probably more to the Guava issue than just this change, but it will 
> make Spark itself work with more versions and reduce our exposure to Guava 
> along the way anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27

2020-03-29 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070255#comment-17070255
 ] 

Jorge Machado commented on SPARK-30272:
---

So I was able to fix it. I build it with profile hadoop 3.2 but after the build 
the hadoop-azure.jar is missing so I added manually into my container and now 
it seems to load. 

I was trying to put guava 28 and remove the 14 but this is a lot of work... why 
do we use a old guava version ? 

> Remove usage of Guava that breaks in Guava 27
> -
>
> Key: SPARK-30272
> URL: https://issues.apache.org/jira/browse/SPARK-30272
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> Background:
> https://issues.apache.org/jira/browse/SPARK-29250
> https://github.com/apache/spark/pull/25932
> Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods 
> that changed between those releases, typically just a rename, but, means one 
> set of code can't work with both, while we want to work with Hadoop 2.x and 
> 3.x. Among them:
> - Objects.toStringHelper was moved to MoreObjects; we can just use the 
> Commons Lang3 equivalent
> - Objects.hashCode etc were renamed; use java.util.Objects equivalents
> - MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread 
> execution we can use a dummy implementation of ExecutorService / Executor
> - TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection
> There is probably more to the Guava issue than just this change, but it will 
> make Spark itself work with more versions and reduce our exposure to Guava 
> along the way anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30272) Remove usage of Guava that breaks in Guava 27

2020-03-28 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069933#comment-17069933
 ] 

Jorge Machado edited comment on SPARK-30272 at 3/28/20, 4:25 PM:
-

Hey Sean, 

This seems still to make problems for example: 
{code:java}
> $ ./bin/run-example SparkPi 100                                               
>                                                                               
>                                                                               
>     [±master ●]> $ ./bin/run-example SparkPi 100                              
>                                                                               
>                                                                               
>                      [±master ●]20/03/28 17:21:13 WARN Utils: Your hostname, 
> Jorges-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 
> 192.168.1.2 instead (on interface en0)20/03/28 17:21:13 WARN Utils: Set 
> SPARK_LOCAL_IP if you need to bind to another address20/03/28 17:21:14 WARN 
> NativeCodeLoader: Unable to load native-hadoop library for your platform... 
> using builtin-java classes where applicableUsing Spark's default log4j 
> profile: org/apache/spark/log4j-defaults.properties20/03/28 17:21:14 INFO 
> SparkContext: Running Spark version 3.1.0-SNAPSHOT20/03/28 17:21:14 INFO 
> ResourceUtils: 
> ==20/03/28 
> 17:21:14 INFO ResourceUtils: No custom resources configured for 
> spark.driver.20/03/28 17:21:14 INFO ResourceUtils: 
> ==20/03/28 
> 17:21:14 INFO SparkContext: Submitted application: Spark Pi20/03/28 17:21:14 
> INFO ResourceProfile: Default ResourceProfile created, executor resources: 
> Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: 
> memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: 
> cpus, amount: 1.0)20/03/28 17:21:14 INFO ResourceProfile: Limiting resource 
> is cpu20/03/28 17:21:14 INFO ResourceProfileManager: Added ResourceProfile 
> id: 020/03/28 17:21:14 INFO SecurityManager: Changing view acls to: 
> jorge20/03/28 17:21:14 INFO SecurityManager: Changing modify acls to: 
> jorge20/03/28 17:21:14 INFO SecurityManager: Changing view acls groups 
> to:20/03/28 17:21:14 INFO SecurityManager: Changing modify acls groups 
> to:20/03/28 17:21:14 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(jorge); groups 
> with view permissions: Set(); users  with modify permissions: Set(jorge); 
> groups with modify permissions: Set()20/03/28 17:21:14 INFO Utils: 
> Successfully started service 'sparkDriver' on port 58192.20/03/28 17:21:14 
> INFO SparkEnv: Registering MapOutputTracker20/03/28 17:21:14 INFO SparkEnv: 
> Registering BlockManagerMaster20/03/28 17:21:14 INFO 
> BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information20/03/28 17:21:14 INFO BlockManagerMasterEndpoint: 
> BlockManagerMasterEndpoint up20/03/28 17:21:14 INFO SparkEnv: Registering 
> BlockManagerMasterHeartbeat20/03/28 17:21:14 INFO DiskBlockManager: Created 
> local directory at 
> /private/var/folders/0h/5b7dw9p11l58hyk0_s0d3cnhgn/T/blockmgr-d9e88815-075e-4c9b-9cc8-21c72e97c86920/03/28
>  17:21:14 INFO MemoryStore: MemoryStore started with capacity 366.3 
> MiB20/03/28 17:21:14 INFO SparkEnv: Registering 
> OutputCommitCoordinator20/03/28 17:21:15 INFO Utils: Successfully started 
> service 'SparkUI' on port 4040.20/03/28 17:21:15 INFO SparkUI: Bound SparkUI 
> to 0.0.0.0, and started at http://192.168.1.2:404020/03/28 17:21:15 INFO 
> SparkContext: Added JAR 
> file:///Users/jorge/Downloads/spark/dist/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar
>  at spark://192.168.1.2:58192/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar 
> with timestamp 158541247516620/03/28 17:21:15 INFO SparkContext: Added JAR 
> file:///Users/jorge/Downloads/spark/dist/examples/jars/scopt_2.12-3.7.1.jar 
> at spark://192.168.1.2:58192/jars/scopt_2.12-3.7.1.jar with timestamp 
> 158541247516620/03/28 17:21:15 INFO Executor: Starting executor ID driver on 
> host 192.168.1.220/03/28 17:21:15 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 
> 58193.20/03/28 17:21:15 INFO NettyBlockTransferService: Server created on 
> 192.168.1.2:5819320/03/28 17:21:15 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policyException in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess 
> at java.lang.ClassLoader.defineClass1(Native Method) at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:763) at 

[jira] [Commented] (SPARK-30272) Remove usage of Guava that breaks in Guava 27

2020-03-28 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069933#comment-17069933
 ] 

Jorge Machado commented on SPARK-30272:
---

Hey Sean, 

This seems still to make problems for example: 
{code:java}
 java.lang.NoClassDefFoundError: 
com/google/common/util/concurrent/internal/InternalFutureFailureAccess 
java.lang.NoClassDefFoundError: 
com/google/common/util/concurrent/internal/InternalFutureFailureAccess at 
java.lang.ClassLoader.defineClass1(Native Method) at 
java.lang.ClassLoader.defineClass(ClassLoader.java:757) at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at 
java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at 
java.net.URLClassLoader.access$100(URLClassLoader.java:74) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:369) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:363) at 
java.security.AccessController.doPrivileged(Native Method) at 
java.net.URLClassLoader.findClass(URLClassLoader.java:362) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419) at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352) at 
java.lang.ClassLoader.defineClass1(Native Method) at 
java.lang.ClassLoader.defineClass(ClassLoader.java:757) at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at 
java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at 
java.net.URLClassLoader.access$100(URLClassLoader.java:74) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:369) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:363) at 
java.security.AccessController.doPrivileged(Native Method) at 
java.net.URLClassLoader.findClass(URLClassLoader.java:362) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419) at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352) at 
java.lang.ClassLoader.defineClass1(Native Method) at 
java.lang.ClassLoader.defineClass(ClassLoader.java:757) at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at 
java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at 
java.net.URLClassLoader.access$100(URLClassLoader.java:74) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:369) at 
java.net.URLClassLoader$1.run(URLClassLoader.java:363) at 
java.security.AccessController.doPrivileged(Native Method) at 
java.net.URLClassLoader.findClass(URLClassLoader.java:362) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419) at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352) at 
com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3472)
 at 
com.google.common.cache.LocalCache$LoadingValueReference.(LocalCache.java:3476)
 at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2134)
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045) at 
com.google.common.cache.LocalCache.get(LocalCache.java:3951) at 
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974) at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958) 
at org.apache.hadoop.security.Groups.getGroups(Groups.java:228) at 
org.apache.hadoop.security.UserGroupInformation.getGroups(UserGroupInformation.java:1588)
 at 
org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1453)
 at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.(AzureBlobFileSystemStore.java:147)
 at 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:104)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) at 
org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:522)
 at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:491)
 at 
org.apache.spark.SparkContext.$anonfun$newAPIHadoopFile$2(SparkContext.scala:1219)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.SparkContext.withScope(SparkContext.scala:757) at 
org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1207) at 
org.apache.spark.api.java.JavaSparkContext.newAPIHadoopFile(JavaSparkContext.scala:484)
{code}
I still see a lot of references to guava 14 on master is this normal ? Sorry 
for the question...

 

 

 

> Remove usage of Guava that 

[jira] [Comment Edited] (SPARK-23897) Guava version

2020-03-28 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069352#comment-17069352
 ] 

Jorge Machado edited comment on SPARK-23897 at 3/28/20, 9:50 AM:
-

I think that master is actually broken at least for commit 
d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: 
{code:java}
./build/mvn  clean package  -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud
{code}
I get: 
{code:java}
jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                             
                                                                                
                                                                                
   [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                
                                                                                
                                                                                
                [10:43:24]> $ java -version                                     
                                                                                
                                                                                
                            [±master ✓]java version "1.8.0_211"Java(TM) SE 
Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM 
(build 25.211-b12, mixed mode) jorge@Jorges-MacBook-Pro 
~/Downloads/spark/dist/bin                                                      
                                                                                
                                                          [10:43:27]> $ 
./run-example SparkPi 100                                                       
                                                                                
                                                                              
[±master ✓]Exception in thread "main" java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at 
org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
 at 
org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427)
 at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
 at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at 
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
If I delete guava 14 and add guava28 it works.


was (Author: jomach):
I think that master is actually broken at least for commit 
d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: 
{code:java}
./build/mvn  clean package  -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud
{code}
I get: 
{code:java}
jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                             
                                                                                
                                                                                
   [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                
                                                                                
                                                                                
                [10:43:24]> $ java -version                                     
                                                                                
                                                                                
                            [±master ✓]java version "1.8.0_211"Java(TM) SE 
Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM 
(build 25.211-b12, mixed mode)
jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                             
                                                                                
                                                                                
   [10:43:27]> $ ./run-example SparkPi 100                                      
                                                                                
                                                                                
               [±master 

[jira] [Commented] (SPARK-23897) Guava version

2020-03-28 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069352#comment-17069352
 ] 

Jorge Machado commented on SPARK-23897:
---

I think that master is actually broken at least for commit 
d025ddbaa7e7b9746d8e47aeed61ed39d2f09f0e. I builded with: 
{code:java}
./build/mvn  clean package  -DskipTests -Phadoop-3.2 -Pkubernetes -Phadoop-cloud
{code}
I get: 
{code:java}
jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                             
                                                                                
                                                                                
   [10:43:24]jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                
                                                                                
                                                                                
                [10:43:24]> $ java -version                                     
                                                                                
                                                                                
                            [±master ✓]java version "1.8.0_211"Java(TM) SE 
Runtime Environment (build 1.8.0_211-b12)Java HotSpot(TM) 64-Bit Server VM 
(build 25.211-b12, mixed mode)
jorge@Jorges-MacBook-Pro ~/Downloads/spark/dist/bin                             
                                                                                
                                                                                
   [10:43:27]> $ ./run-example SparkPi 100                                      
                                                                                
                                                                                
               [±master ✓]Exception in thread "main" 
java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at 
org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at 
org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopHiveConfigurations(SparkHadoopUtil.scala:456)
 at 
org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:427)
 at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342)
 at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at 
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

> Guava version
> -
>
> Key: SPARK-23897
> URL: https://issues.apache.org/jira/browse/SPARK-23897
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sercan Karaoglu
>Priority: Minor
>
> Guava dependency version 14 is pretty old, needs to be updated to at least 
> 16, google cloud storage connector uses newer one which causes pretty popular 
> error with guava; "java.lang.NoSuchMethodError: 
> com.google.common.base.Splitter.splitToList(Ljava/lang/CharSequence;)Ljava/util/List;"
>  and causes app to crash



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046958#comment-17046958
 ] 

Jorge Machado commented on SPARK-26412:
---

Thanks for the Tipp. It helps 

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-27 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046373#comment-17046373
 ] 

Jorge Machado commented on SPARK-26412:
---

Well I was thinking on something more. like I would like to give a,b,c to 
another object. Like

 
{code:java}
class SomeClass():
  def __init__(a,b,c):
 pass

def map_func(batch_iter):

   dataset = SomeClass(batch_iter[0], batch_iter[1], batch_iter[2]) <- this 
does not work. 

{code}
and another thing, it would be great if we could just yield a json for example 
instead of this fixed types

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2020-02-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046265#comment-17046265
 ] 

Jorge Machado commented on SPARK-26412:
---

Hi, one question. 

when using "a tuple of pd.Series if UDF is called with more than one Spark DF 
columns" how can I get the Series into a variables. like 

a, b, c = iterator ? map seems not to be a python tuple ... 

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITER)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> The type of each batch is:
> * a pd.Series if UDF is called with a single non-struct-type column
> * a tuple of pd.Series if UDF is called with more than one Spark DF columns
> * a pd.DataFrame if UDF is called with a single StructType column
> Examples:
> {code}
> @pandas_udf(...)
> def evaluate(batch_iter):
>   model = ... # load model
>   for features, label in batch_iter:
> pred = model.predict(features)
> yield (pred - label).abs()
> df.select(evaluate(col("features"), col("label")).alias("err"))
> {code}
> {code}
> @pandas_udf(...)
> def evaluate(pdf_iter):
>   model = ... # load model
>   for pdf in pdf_iter:
> pred = model.predict(pdf['x'])
> yield (pred - pdf['y']).abs()
> df.select(evaluate(struct(col("features"), col("label"))).alias("err"))
> {code}
> If the UDF doesn't return the same number of records for the entire 
> partition, user should see an error. We don't restrict that every yield 
> should match the input batch size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-11 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034492#comment-17034492
 ] 

Jorge Machado commented on SPARK-24615:
---

Yeah, that was my question. Thanks for the response. I will look at rapid.ai 
and try to use it inside a partition or so... 

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-11 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034277#comment-17034277
 ] 

Jorge Machado commented on SPARK-24615:
---

[~tgraves] thanks for the input. It would be great to have one or two examples 
on how to use the GPUs within a dataset. 

I tried to figure out the api but I did not find any useful docs. Any tip?

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-02-04 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029788#comment-17029788
 ] 

Jorge Machado commented on SPARK-30647:
---

2.4x has the same issue.

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2020-02-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028856#comment-17028856
 ] 

Jorge Machado commented on SPARK-27990:
---

[~nchammas]: Just pass this like: 
{code:java}
//.option("pathGlobFilter", 
".*\\.png|.*\\.jpg|.*\\.jpeg|.*\\.PNG|.*\\.JPG|.*\\.JPEG")
//.option("recursiveFileLookup", "true")
{code}

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27990) Provide a way to recursively load data from datasource

2020-02-03 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028854#comment-17028854
 ] 

Jorge Machado commented on SPARK-27990:
---

Can we backport this to 2.4.4 ?

> Provide a way to recursively load data from datasource
> --
>
> Key: SPARK-27990
> URL: https://issues.apache.org/jira/browse/SPARK-27990
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Provide a way to recursively load data from datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-27 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024242#comment-17024242
 ] 

Jorge Machado commented on SPARK-30647:
---

I found a way to overcome this. I just replace %20 with " " like this:
{code:java}
(file: PartitionedFile) => {
val origin = file.filePath.
replace("%20", " ")
}{code}

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/27/20 7:22 AM:


Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 

 

I created https://issues.apache.org/jira/browse/SPARK-30647
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: 

[jira] [Updated] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-26 Thread Jorge Machado (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-30647:
--
Issue Type: Bug  (was: Improvement)

> When creating a custom datasource File NotFoundExpection happens
> 
>
> Key: SPARK-30647
> URL: https://issues.apache.org/jira/browse/SPARK-30647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jorge Machado
>Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the 
> error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 
> (TID 213, localhost, executor driver): java.io.FileNotFoundException: File 
> file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
> underlying files have been updated. You can explicitly invalidate the cache 
> in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating 
> the Dataset/DataFrame involved. at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>  at 
> org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
> length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

2020-01-26 Thread Jorge Machado (Jira)
Jorge Machado created SPARK-30647:
-

 Summary: When creating a custom datasource File NotFoundExpection 
happens
 Key: SPARK-30647
 URL: https://issues.apache.org/jira/browse/SPARK-30647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Jorge Machado


Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 

when I pass a path or a file that has a white space it seems to fail wit the 
error: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist It is possible the 
underlying files have been updated. You can explicitly invalidate the cache in 
Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the 
Dataset/DataFrame involved. at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
 at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) 
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
{code}
I'm happy to fix this if someone tells me where I need to look.  

I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
{code:java}
inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, 
length))
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM:


Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
So I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: 

[jira] [Comment Edited] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935
 ] 

Jorge Machado edited comment on SPARK-23148 at 1/26/20 9:29 PM:


Hi [~hyukjin.kwon] and [~henryr]  , I have the same problem if I create a 
custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}


was (Author: jomach):
Hi [~hyukjin.kwon] , I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects 

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2020-01-26 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023935#comment-17023935
 ] 

Jorge Machado commented on SPARK-23148:
---

So I have the same problem if I create a custom data source 

```

class ImageFileValidator extends FileFormat with DataSourceRegister with 
Serializable

```

So the Problem Needs to be in some other places. 

Here my trace: 
{code:java}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 
213, localhost, executor driver): java.io.FileNotFoundException: File 
file:somePath/0019_leftImg8%20bit.png does not exist
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   at 
org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125)
   at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
   at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)

{code}

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Henry Robinson
>Priority: Major
> Fix For: 2.3.0
>
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}
> Trying to manually escape fails in a different place:
> {code}
> spark.read.option("multiLine", true).csv("/tmp/a%20b%20c/a.csv").count
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/tmp/a%20b%20c/a.csv;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:683)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:387)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29158) Expose SerializableConfiguration for DSv2

2019-12-16 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997947#comment-16997947
 ] 

Jorge Machado commented on SPARK-29158:
---

How can we get SerializableConfiguration with 2.4.4 ? Any alternative ?

> Expose SerializableConfiguration for DSv2
> -
>
> Key: SPARK-29158
> URL: https://issues.apache.org/jira/browse/SPARK-29158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.0.0
>
>
> Since we use it frequently inside of our own DataSourceV2 implementations (13 
> times from `
>  grep -r broadcastedConf ./sql/core/src/ |grep val |wc -l`
> ) we should expose the SerializableConfiguration for DSv2 dev work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2019-07-11 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882686#comment-16882686
 ] 

Jorge Machado commented on SPARK-24615:
---

Hi Guys, is there any progress here ? I would like to help on the 
implementation step. I did not see any interfaces designed yet.

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http

2019-04-01 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806495#comment-16806495
 ] 

Jorge Machado commented on SPARK-27208:
---

Any help here please ? I really think this is broken . 

> RestSubmissionClient only supports http
> ---
>
> Key: SPARK-27208
> URL: https://issues.apache.org/jira/browse/SPARK-27208
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Minor
>
> As stand of now the class RestSubmissionClient does not support https, which 
> fails for example if we run mesos master with ssl and in cluster mode. 
> The spark-submit command fails with: Mesos cluster mode is only supported 
> through the REST submission API
>  
> I create a PR for this which checks if the master endpoint given can speak 
> ssl before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http

2019-03-20 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796871#comment-16796871
 ] 

Jorge Machado commented on SPARK-27208:
---

Furthermore I'm getting the next error: 
./bin/spark-submit  --class org.apache.spark.examples.SparkPi   --master 
[mesos://host:5050/api] --deploy-mode cluster   --conf 
spark.master.rest.enabled=true  --total-executor-cores 4 --jars 
/home/machjor/spark-2.4.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.0.jar
  
/home/machjor/spark-2.4.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.4.0.jar
 10
2019-03-18 20:39:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2019-03-18 20:39:31 INFO  RestSubmissionClient:54 - Submitting a request to 
launch an application in [mesos://host:5050/api].
2019-03-18 20:39:32 ERROR RestSubmissionClient:70 - Server responded with error:
Some(Failed to validate master::Call: Expecting 'type' to be present)
2019-03-18 20:39:32 ERROR RestSubmissionClient:70 - Error: Server responded 
with message of unexpected type ErrorResponse.
2019-03-18 20:39:32 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-03-18 20:39:32 INFO  ShutdownHookManager:54 - Deleting directory 
/tmp/spark-259c0e66-c2ab-43b4-90df-

> RestSubmissionClient only supports http
> ---
>
> Key: SPARK-27208
> URL: https://issues.apache.org/jira/browse/SPARK-27208
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Minor
>
> As stand of now the class RestSubmissionClient does not support https, which 
> fails for example if we run mesos master with ssl and in cluster mode. 
> The spark-submit command fails with: Mesos cluster mode is only supported 
> through the REST submission API
>  
> I create a PR for this which checks if the master endpoint given can speak 
> ssl before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http

2019-03-20 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796860#comment-16796860
 ] 

Jorge Machado commented on SPARK-27208:
---

I would like to take this.

> RestSubmissionClient only supports http
> ---
>
> Key: SPARK-27208
> URL: https://issues.apache.org/jira/browse/SPARK-27208
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Minor
>
> As stand of now the class RestSubmissionClient does not support https, which 
> fails for example if we run mesos master with ssl and in cluster mode. 
> The spark-submit command fails with: Mesos cluster mode is only supported 
> through the REST submission API
>  
> I create a PR for this which checks if the master endpoint given can speak 
> ssl before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27208) RestSubmissionClient only supports http

2019-03-20 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-27208:
--
Shepherd: Sean Owen

> RestSubmissionClient only supports http
> ---
>
> Key: SPARK-27208
> URL: https://issues.apache.org/jira/browse/SPARK-27208
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Minor
>
> As stand of now the class RestSubmissionClient does not support https, which 
> fails for example if we run mesos master with ssl and in cluster mode. 
> The spark-submit command fails with: Mesos cluster mode is only supported 
> through the REST submission API
>  
> I create a PR for this which checks if the master endpoint given can speak 
> ssl before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27208) RestSubmissionClient only supports http

2019-03-19 Thread Jorge Machado (JIRA)
Jorge Machado created SPARK-27208:
-

 Summary: RestSubmissionClient only supports http
 Key: SPARK-27208
 URL: https://issues.apache.org/jira/browse/SPARK-27208
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.4.0
Reporter: Jorge Machado


As stand of now the class RestSubmissionClient does not support https, which 
fails for example if we run mesos master with ssl and in cluster mode. 

The spark-submit command fails with: Mesos cluster mode is only supported 
through the REST submission API

 

I create a PR for this which checks if the master endpoint given can speak ssl 
before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27208) RestSubmissionClient only supports http

2019-03-19 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796544#comment-16796544
 ] 

Jorge Machado commented on SPARK-27208:
---

relates to this as the follow up error

> RestSubmissionClient only supports http
> ---
>
> Key: SPARK-27208
> URL: https://issues.apache.org/jira/browse/SPARK-27208
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Minor
>
> As stand of now the class RestSubmissionClient does not support https, which 
> fails for example if we run mesos master with ssl and in cluster mode. 
> The spark-submit command fails with: Mesos cluster mode is only supported 
> through the REST submission API
>  
> I create a PR for this which checks if the master endpoint given can speak 
> ssl before submitting the command. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]

2018-12-17 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723761#comment-16723761
 ] 

Jorge Machado commented on SPARK-26324:
---

[~hyukjin.kwon] I created a PR for this docs

> Spark submit does not work with messos over ssl [Missing docs]
> --
>
> Key: SPARK-26324
> URL: https://issues.apache.org/jira/browse/SPARK-26324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hi guys, 
> I was trying to run the examples on a mesos cluster that uses https. I tried 
> with rest endpoint: 
> {code:java}
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> mesos://:5050 --conf spark.master.rest.enabled=true 
> --deploy-mode cluster --supervise --executor-memory 10G 
> --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
> {code}
> The error that I get on the host where I started the spark-submit is:
> {code:java}
> 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
> launch an application in mesos://:5050.
> 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to 
> server mesos://:5050.
> Exception in thread "main" 
> org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
> to server
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
> to connect to server
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
> ... 15 more
> Caused by: java.net.SocketException: Connection reset
> {code}
> I'm pretty sure this is because of the hardcoded http:// here:
>  
>  
> {code:java}
> RestSubmissionClient.scala
> /** Return the base URL for communicating with the server, including the 
> protocol version. */
> private def getBaseUrl(master: String): String = {
>   var masterUrl = master
>   supportedMasterPrefixes.foreach { prefix =>
> if (master.startsWith(prefix)) {
>   masterUrl = master.stripPrefix(prefix)
> }
>   }
>   masterUrl = masterUrl.stripSuffix("/")
>   s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
> }
> {code}
> Then I tried without the _--deploy-mode cluster_ and I get: 
> {code:java}
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> mesos://:5050  --supervise --executor-memory 10G 
> --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
>  {code}
> On the spark console I get: 
> {code:java}
> 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started 
> at http://_host:4040
> 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
> file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
>  at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 
> 1544450465799
> I1210 15:01:05.963078 37943 

[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]

2018-12-11 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-26324:
--
Summary: Spark submit does not work with messos over ssl [Missing docs]  
(was: Spark submit does not work with messos over ssl)

> Spark submit does not work with messos over ssl [Missing docs]
> --
>
> Key: SPARK-26324
> URL: https://issues.apache.org/jira/browse/SPARK-26324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hi guys, 
> I was trying to run the examples on a mesos cluster that uses https. I tried 
> with rest endpoint: 
>  
> {code:java}
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> mesos://:5050 --conf spark.master.rest.enabled=true 
> --deploy-mode cluster --supervise --executor-memory 10G 
> --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
> {code}
> The error that I get on the host where I started the spark-submit is:
>  
>  
> {code:java}
> 2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
> launch an application in mesos://:5050.
> 2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to 
> server mesos://:5050.
> Exception in thread "main" 
> org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
> to server
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
> to connect to server
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
> at 
> org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
> ... 15 more
> Caused by: java.net.SocketException: Connection reset
> {code}
> I'm pretty sure this is because of the hardcoded http:// here:
>  
>  
> {code:java}
> RestSubmissionClient.scala
> /** Return the base URL for communicating with the server, including the 
> protocol version. */
> private def getBaseUrl(master: String): String = {
>   var masterUrl = master
>   supportedMasterPrefixes.foreach { prefix =>
> if (master.startsWith(prefix)) {
>   masterUrl = master.stripPrefix(prefix)
> }
>   }
>   masterUrl = masterUrl.stripSuffix("/")
>   s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
> }
> {code}
>  
> Then I tried without the _--deploy-mode cluster_ and I get: 
>  
> {code:java}
> ./spark-submit  --class org.apache.spark.examples.SparkPi --master 
> mesos://:5050  --supervise --executor-memory 10G 
> --total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
> {code}
>  
> On the spark console I get: 
>  
> {code:java}
> 2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started 
> at http://_host:4040
> 2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
> file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
>  at 

[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl [Missing docs]

2018-12-11 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-26324:
--
Description: 
Hi guys, 

I was trying to run the examples on a mesos cluster that uses https. I tried 
with rest endpoint: 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050 --conf spark.master.rest.enabled=true 
--deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 
100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
The error that I get on the host where I started the spark-submit is:
{code:java}
2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
launch an application in mesos://:5050.
2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server 
mesos://:5050.
Exception in thread "main" 
org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
to connect to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
... 15 more
Caused by: java.net.SocketException: Connection reset
{code}
I'm pretty sure this is because of the hardcoded http:// here:

 

 
{code:java}
RestSubmissionClient.scala
/** Return the base URL for communicating with the server, including the 
protocol version. */
private def getBaseUrl(master: String): String = {
  var masterUrl = master
  supportedMasterPrefixes.foreach { prefix =>
if (master.startsWith(prefix)) {
  masterUrl = master.stripPrefix(prefix)
}
  }
  masterUrl = masterUrl.stripSuffix("/")
  s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
}
{code}
Then I tried without the _--deploy-mode cluster_ and I get: 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050  --supervise --executor-memory 10G 
--total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
 {code}
On the spark console I get: 
{code:java}
2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at 
http://_host:4040
2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
 at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 
1544450465799
I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2
I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at 
master@53.54.195.251:5050
I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting 
to register without authentication
E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45206: Transport endpoint is not connected
E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 
307, address 53.54.195.251:45212: Transport endpoint is not connected
E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45222: Transport endpoint is not 

[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl

2018-12-11 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-26324:
--
Description: 
Hi guys, 

I was trying to run the examples on a mesos cluster that uses https. I tried 
with rest endpoint: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050 --conf spark.master.rest.enabled=true 
--deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 
100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
The error that I get on the host where I started the spark-submit is:

 

 
{code:java}
2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
launch an application in mesos://:5050.
2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server 
mesos://:5050.
Exception in thread "main" 
org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
to connect to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
... 15 more
Caused by: java.net.SocketException: Connection reset
{code}
I'm pretty sure this is because of the hardcoded http:// here:

 

 
{code:java}
RestSubmissionClient.scala
/** Return the base URL for communicating with the server, including the 
protocol version. */
private def getBaseUrl(master: String): String = {
  var masterUrl = master
  supportedMasterPrefixes.foreach { prefix =>
if (master.startsWith(prefix)) {
  masterUrl = master.stripPrefix(prefix)
}
  }
  masterUrl = masterUrl.stripSuffix("/")
  s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
}
{code}
 

Then I tried without the _--deploy-mode cluster_ and I get: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050  --supervise --executor-memory 10G 
--total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
 

On the spark console I get: 

 
{code:java}
2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at 
http://_host:4040
2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
 at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 
1544450465799
I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2
I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at 
master@53.54.195.251:5050
I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting 
to register without authentication
E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45206: Transport endpoint is not connected
E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 
307, address 53.54.195.251:45212: Transport endpoint is not connected
E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45222: 

[jira] [Updated] (SPARK-26324) Spark submit does not work with messos over ssl

2018-12-11 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-26324:
--
Description: 
Hi guys, 

I was trying to run the examples on a mesos cluster that uses https. I tried 
with rest endpoint: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050 --conf spark.master.rest.enabled=true 
--deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 
100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
The error that I get on the host where I started the spark-submit is:

 

 
{code:java}
2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
launch an application in mesos://:5050.
2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server 
mesos://:5050.
Exception in thread "main" 
org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
to connect to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
... 15 more
Caused by: java.net.SocketException: Connection reset
{code}
I'm pretty sure this is because of the hardcoded http:// here:

 

 
{code:java}
RestSubmissionClient.scala
/** Return the base URL for communicating with the server, including the 
protocol version. */
private def getBaseUrl(master: String): String = {
  var masterUrl = master
  supportedMasterPrefixes.foreach { prefix =>
if (master.startsWith(prefix)) {
  masterUrl = master.stripPrefix(prefix)
}
  }
  masterUrl = masterUrl.stripSuffix("/")
  s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
}
{code}
 

Then I tried without the _--deploy-mode cluster_ and I get: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050  --supervise --executor-memory 10G 
--total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
 

On the spark console I get: 

 
{code:java}
2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at 
http://_host:4040
2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
 at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 
1544450465799
I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2
I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at 
master@53.54.195.251:5050
I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting 
to register without authentication
E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45206: Transport endpoint is not connected
E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 
307, address 53.54.195.251:45212: Transport endpoint is not connected
E1210 15:01:05.969405 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45222: 

[jira] [Created] (SPARK-26324) Spark submit does not work with messos over ssl

2018-12-10 Thread Jorge Machado (JIRA)
Jorge Machado created SPARK-26324:
-

 Summary: Spark submit does not work with messos over ssl
 Key: SPARK-26324
 URL: https://issues.apache.org/jira/browse/SPARK-26324
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0
Reporter: Jorge Machado


Hi guys, 

I was trying to run the examples on a mesos cluster that uses https. I tried 
with rest endpoint: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050 --conf spark.master.rest.enabled=true 
--deploy-mode cluster --supervise --executor-memory 10G --total-executor-cores 
100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
The error that I get on the host where I started the spark-submit is:

 

 
{code:java}
2018-12-10 15:08:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
2018-12-10 15:08:39 INFO RestSubmissionClient:54 - Submitting a request to 
launch an application in mesos://:5050.
2018-12-10 15:08:39 WARN RestSubmissionClient:66 - Unable to connect to server 
mesos://:5050.
Exception in thread "main" 
org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect 
to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:104)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:86)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:86)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.run(RestSubmissionClient.scala:443)
at 
org.apache.spark.deploy.rest.RestSubmissionClientApp.start(RestSubmissionClient.scala:455)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable 
to connect to server
at 
org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:281)
at 
org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:225)
at 
org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:90)
... 15 more
Caused by: java.net.SocketException: Connection reset
{code}
I'm pretty sure this is because of the hardcoded http:// here:

 

 
{code:java}
RestSubmissionClient.scala
/** Return the base URL for communicating with the server, including the 
protocol version. */
private def getBaseUrl(master: String): String = {
  var masterUrl = master
  supportedMasterPrefixes.foreach { prefix =>
if (master.startsWith(prefix)) {
  masterUrl = master.stripPrefix(prefix)
}
  }
  masterUrl = masterUrl.stripSuffix("/")
  s"http://$masterUrl/$PROTOCOL_VERSION/submissions; <--- hardcoded http
}
{code}
 

Then I tried without the _--deploy-mode cluster_ and I get: 

 
{code:java}
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
mesos://:5050  --supervise --executor-memory 10G 
--total-executor-cores 100 ../examples/jars/spark-examples_2.11-2.4.0.jar 1000
{code}
 

On the spark console I get: 

 
{code:java}
2018-12-10 15:01:05 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at 
http://_host:4040
2018-12-10 15:01:05 INFO SparkContext:54 - Added JAR 
file:/home//spark-2.4.0-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.0.jar
 at spark://_host:35719/jars/spark-examples_2.11-2.4.0.jar with timestamp 
1544450465799
I1210 15:01:05.963078 37943 sched.cpp:232] Version: 1.3.2
I1210 15:01:05.966814 37911 sched.cpp:336] New master detected at 
master@53.54.195.251:5050
I1210 15:01:05.967010 37911 sched.cpp:352] No credentials provided. Attempting 
to register without authentication
E1210 15:01:05.967347 37942 process.cpp:2455] Failed to shutdown socket with fd 
307, address 53.54.195.251:45206: Transport endpoint is not connected
E1210 15:01:05.968212 37942 process.cpp:2369] Failed to shutdown socket with fd 
307, 

[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2018-12-10 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714697#comment-16714697
 ] 

Jorge Machado commented on SPARK-19320:
---

Is this still active ?  The question arises what about Kubernetes 

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2018-10-22 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659253#comment-16659253
 ] 

Jorge Machado commented on SPARK-20055:
---

This was taken on another ticket. This Can be closed

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Attachment: (was: testDF.parquet)

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jorge Machado
>Priority: Major
> Fix For: 2.3.0
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.
>  
>  
> As Code not 100% the same but I think there is really a bug there: 
>  
> {code:java}
> BigDecimal [] objects = new BigDecimal[]{
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D)};
> Row dataRow = new GenericRow(objects);
> Row dataRow2 = new GenericRow(objects);
> StructType structType = new StructType()
> .add("id1", DataTypes.createDecimalType(38,10), true)
> .add("id2", DataTypes.createDecimalType(38,10), true)
> .add("id3", DataTypes.createDecimalType(38,10), true)
> .add("id4", DataTypes.createDecimalType(38,10), true);
> final Dataset dataFrame = 
> sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
> System.out.println(dataFrame.schema());
> dataFrame.show();
> final Dataset  df1 = dataFrame.groupBy("id1","id2")
> .agg( mean("id3").alias("projection_factor"));
> df1.show();
> final Dataset df2 = df1
> .groupBy("id1")
> .agg(max("projection_factor"));
> df2.show();
> {code}
>  
> The df2 should have:
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 3.5714285714|
> ++--+
> {code}
> instead it returns: 
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 0.00035714285714|
> ++--+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado resolved SPARK-24401.
---
Resolution: Fixed

Was already Fixed on 2.3.0

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jorge Machado
>Priority: Major
> Fix For: 2.3.0
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.
>  
>  
> As Code not 100% the same but I think there is really a bug there: 
>  
> {code:java}
> BigDecimal [] objects = new BigDecimal[]{
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D)};
> Row dataRow = new GenericRow(objects);
> Row dataRow2 = new GenericRow(objects);
> StructType structType = new StructType()
> .add("id1", DataTypes.createDecimalType(38,10), true)
> .add("id2", DataTypes.createDecimalType(38,10), true)
> .add("id3", DataTypes.createDecimalType(38,10), true)
> .add("id4", DataTypes.createDecimalType(38,10), true);
> final Dataset dataFrame = 
> sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
> System.out.println(dataFrame.schema());
> dataFrame.show();
> final Dataset  df1 = dataFrame.groupBy("id1","id2")
> .agg( mean("id3").alias("projection_factor"));
> df1.show();
> final Dataset df2 = df1
> .groupBy("id1")
> .agg(max("projection_factor"));
> df2.show();
> {code}
>  
> The df2 should have:
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 3.5714285714|
> ++--+
> {code}
> instead it returns: 
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 0.00035714285714|
> ++--+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Affects Version/s: (was: 2.3.0)
Fix Version/s: 2.3.0

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jorge Machado
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.
>  
>  
> As Code not 100% the same but I think there is really a bug there: 
>  
> {code:java}
> BigDecimal [] objects = new BigDecimal[]{
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D)};
> Row dataRow = new GenericRow(objects);
> Row dataRow2 = new GenericRow(objects);
> StructType structType = new StructType()
> .add("id1", DataTypes.createDecimalType(38,10), true)
> .add("id2", DataTypes.createDecimalType(38,10), true)
> .add("id3", DataTypes.createDecimalType(38,10), true)
> .add("id4", DataTypes.createDecimalType(38,10), true);
> final Dataset dataFrame = 
> sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
> System.out.println(dataFrame.schema());
> dataFrame.show();
> final Dataset  df1 = dataFrame.groupBy("id1","id2")
> .agg( mean("id3").alias("projection_factor"));
> df1.show();
> final Dataset df2 = df1
> .groupBy("id1")
> .agg(max("projection_factor"));
> df2.show();
> {code}
>  
> The df2 should have:
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 3.5714285714|
> ++--+
> {code}
> instead it returns: 
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 0.00035714285714|
> ++--+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493118#comment-16493118
 ] 

Jorge Machado commented on SPARK-24401:
---

Yes, I just tried my code and it works on spark 2.3.0

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jorge Machado
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.
>  
>  
> As Code not 100% the same but I think there is really a bug there: 
>  
> {code:java}
> BigDecimal [] objects = new BigDecimal[]{
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D)};
> Row dataRow = new GenericRow(objects);
> Row dataRow2 = new GenericRow(objects);
> StructType structType = new StructType()
> .add("id1", DataTypes.createDecimalType(38,10), true)
> .add("id2", DataTypes.createDecimalType(38,10), true)
> .add("id3", DataTypes.createDecimalType(38,10), true)
> .add("id4", DataTypes.createDecimalType(38,10), true);
> final Dataset dataFrame = 
> sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
> System.out.println(dataFrame.schema());
> dataFrame.show();
> final Dataset  df1 = dataFrame.groupBy("id1","id2")
> .agg( mean("id3").alias("projection_factor"));
> df1.show();
> final Dataset df2 = df1
> .groupBy("id1")
> .agg(max("projection_factor"));
> df2.show();
> {code}
>  
> The df2 should have:
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 3.5714285714|
> ++--+
> {code}
> instead it returns: 
> {code:java}
> ++--+
> | id1|max(projection_factor)|
> ++--+
> |3.5714285714| 0.00035714285714|
> ++--+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Description: 
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
100. the result should not be bigger that 5 but with get 265820543091454 as 
result back.

 

 

As Code not 100% the same but I think there is really a bug there: 

 
{code:java}
BigDecimal [] objects = new BigDecimal[]{
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D)};
Row dataRow = new GenericRow(objects);
Row dataRow2 = new GenericRow(objects);
StructType structType = new StructType()
.add("id1", DataTypes.createDecimalType(38,10), true)
.add("id2", DataTypes.createDecimalType(38,10), true)
.add("id3", DataTypes.createDecimalType(38,10), true)
.add("id4", DataTypes.createDecimalType(38,10), true);

final Dataset dataFrame = 
sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
System.out.println(dataFrame.schema());
dataFrame.show();
final Dataset  df1 = dataFrame.groupBy("id1","id2")
.agg( mean("id3").alias("projection_factor"));
df1.show();
final Dataset df2 = df1
.groupBy("id1")
.agg(max("projection_factor"));

df2.show();
{code}
 

The df2 should have:
{code:java}
++--+
| id1|max(projection_factor)|
++--+
|3.5714285714| 3.5714285714|
++--+

{code}
instead it returns: 
{code:java}
++--+
| id1|max(projection_factor)|
++--+
|3.5714285714| 0.00035714285714|
++--+
{code}
 

 

  was:
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
100. the result should not be bigger that 5 but with get 265820543091454 as 
result back.

 

 

As Code not 100% the same but I think there is really a bug there:

 
{code:java}
BigDecimal [] objects = new BigDecimal[]{
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D)};
Row dataRow = new GenericRow(objects);
Row dataRow2 = new GenericRow(objects);
StructType structType = new StructType()
.add("id1", DataTypes.createDecimalType(38,10), true)
.add("id2", DataTypes.createDecimalType(38,10), true)
.add("id3", DataTypes.createDecimalType(38,10), true)
.add("id4", DataTypes.createDecimalType(38,10), true);

final Dataset dataFrame = 
sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
System.out.println(dataFrame.schema());
dataFrame.show();
final Dataset  df1 = dataFrame.groupBy("id1","id2")
.agg( mean("id3").alias("projection_factor"));
df1.show();
final Dataset df2 = df1
.groupBy("id1")
.agg(max("projection_factor"));

df2.show();
{code}


> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> 

[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Description: 
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
100. the result should not be bigger that 5 but with get 265820543091454 as 
result back.

 

 

As Code not 100% the same but I think there is really a bug there:

 
{code:java}
BigDecimal [] objects = new BigDecimal[]{
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D),
new BigDecimal(3.5714285714D)};
Row dataRow = new GenericRow(objects);
Row dataRow2 = new GenericRow(objects);
StructType structType = new StructType()
.add("id1", DataTypes.createDecimalType(38,10), true)
.add("id2", DataTypes.createDecimalType(38,10), true)
.add("id3", DataTypes.createDecimalType(38,10), true)
.add("id4", DataTypes.createDecimalType(38,10), true);

final Dataset dataFrame = 
sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
System.out.println(dataFrame.schema());
dataFrame.show();
final Dataset  df1 = dataFrame.groupBy("id1","id2")
.agg( mean("id3").alias("projection_factor"));
df1.show();
final Dataset df2 = df1
.groupBy("id1")
.agg(max("projection_factor"));

df2.show();
{code}

  was:
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
100. the result should not be bigger that 5 but with get 265820543091454 as 
result back.


> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.
>  
>  
> As Code not 100% the same but I think there is really a bug there:
>  
> {code:java}
> BigDecimal [] objects = new BigDecimal[]{
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D),
> new BigDecimal(3.5714285714D)};
> Row dataRow = new GenericRow(objects);
> Row dataRow2 = new GenericRow(objects);
> StructType structType = new StructType()
> .add("id1", DataTypes.createDecimalType(38,10), true)
> .add("id2", DataTypes.createDecimalType(38,10), true)
> .add("id3", DataTypes.createDecimalType(38,10), true)
> .add("id4", DataTypes.createDecimalType(38,10), true);
> final Dataset dataFrame = 
> sparkSession.createDataFrame(Arrays.asList(dataRow,dataRow2), structType);
> System.out.println(dataFrame.schema());
> dataFrame.show();
> final Dataset  df1 = dataFrame.groupBy("id1","id2")
> .agg( 

[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Description: 
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
100. the result should not be bigger that 5 but with get 265820543091454 as 
result back.

  was:
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
5. the result should not be bigger that 5 but with get 265820543091454 as 
result back.


> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 100. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Description: 
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.

The dataset as circa 800 Rows and the projection_factor has values from 0 until 
5. the result should not be bigger that 5 but with get 265820543091454 as 
result back.

  was:
Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.


> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.
> The dataset as circa 800 Rows and the projection_factor has values from 0 
> until 5. the result should not be bigger that 5 but with get 
> 265820543091454 as result back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Attachment: testDF.parquet

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Attachment: (was: testDF.parquet)

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492326#comment-16492326
 ] 

Jorge Machado commented on SPARK-24401:
---

Probably is worth to mention that my dataset is coming from a oracle DB

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)
Jorge Machado created SPARK-24401:
-

 Summary: Aggreate on Decimal Types does not work
 Key: SPARK-24401
 URL: https://issues.apache.org/jira/browse/SPARK-24401
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.0
Reporter: Jorge Machado
 Attachments: testDF.parquet

Hi, 

I think I found a really ugly bug in spark when performing aggregations with 
Decimals

To reproduce: 

 
{code:java}
val df = spark.read.parquet("attached file")
val first_agg = fact_df.groupBy("id1", "id2", 
"start_date").agg(mean("projection_factor").alias("projection_factor"))
first_agg.show
val second_agg = 
first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
min("projection_factor").alias("minf"))
second_agg.show
{code}
First aggregation works fine the second aggregation seems to be summing instead 
of max value. I tried with spark 2.2.0 and 2.3.0 same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24401) Aggreate on Decimal Types does not work

2018-05-28 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-24401:
--
Attachment: testDF.parquet

> Aggreate on Decimal Types does not work
> ---
>
> Key: SPARK-24401
> URL: https://issues.apache.org/jira/browse/SPARK-24401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jorge Machado
>Priority: Major
> Attachments: testDF.parquet
>
>
> Hi, 
> I think I found a really ugly bug in spark when performing aggregations with 
> Decimals
> To reproduce: 
>  
> {code:java}
> val df = spark.read.parquet("attached file")
> val first_agg = fact_df.groupBy("id1", "id2", 
> "start_date").agg(mean("projection_factor").alias("projection_factor"))
> first_agg.show
> val second_agg = 
> first_agg.groupBy("id1","id2").agg(max("projection_factor").alias("maxf"), 
> min("projection_factor").alias("minf"))
> second_agg.show
> {code}
> First aggregation works fine the second aggregation seems to be summing 
> instead of max value. I tried with spark 2.2.0 and 2.3.0 same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-25 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-17592:
--
Comment: was deleted

(was: I'm hitting the same issue I'm afraid but in slightly another way. When I 
have a dataframe (that comes from oracle DB ) as parquet I can see in the logs 
that a field is beeing saved as integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!)

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185
 ] 

Jorge Machado edited comment on SPARK-17592 at 5/24/18 3:11 PM:


I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe (that comes from oracle DB ) as parquet I can see in the logs that 
a field is beeing saved as integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!


was (Author: jomach):
I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe as parquet I can see in the logs that a field is beeing saved as 
integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Machado updated SPARK-17592:
--
Attachment: image-2018-05-24-17-10-24-515.png

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2018-05-24 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489185#comment-16489185
 ] 

Jorge Machado commented on SPARK-17592:
---

I'm hitting the same issue I'm afraid but in slightly another way. When I have 
a dataframe as parquet I can see in the logs that a field is beeing saved as 
integer : 

 

{ "type" : "struct", "fields" : [ \{ "name" : "project_id", "type" : "integer", 
"nullable" : true, "metadata" : { } },... 

 

on hue (which reads from hive) I see : 

!image-2018-05-24-17-10-24-515.png!

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-05 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194129#comment-16194129
 ] 

Jorge Machado commented on SPARK-20055:
---

[~aash] Should I copy paste that options ?  And there is some docs already


{code:java}
def
csv(paths: String*): DataFrame
 Permalink
Loads CSV files and returns the result as a DataFrame.

This function will go through the input once to determine the input schema if 
inferSchema is enabled. To avoid going through the entire data once, disable 
inferSchema option or specify the schema explicitly using schema.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets the single character as a separator for each field and 
value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets the single character used for escaping quoted values 
where the separator can be part of the value. If you would like to turn off 
quotations, you need to set not null but an empty string. This behaviour is 
different from com.databricks.spark.csv.
escape (default \): sets the single character used for escaping quotes inside 
an already quoted value.
comment (default empty string): sets the single character used for skipping 
lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
inferSchema (default false): infers the input schema automatically from data. 
It requires one extra pass over the data.
ignoreLeadingWhiteSpace (default false): a flag indicating whether or not 
leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): a flag indicating whether or not 
trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null 
value. Since 2.0.1, this applies to all supported types including the string 
type.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive 
infinity value.
negativeInf (default -Inf): sets the string representation of a negative 
infinity value.
dateFormat (default -MM-dd): sets the string that indicates a date format. 
Custom date formats follow the formats at java.text.SimpleDateFormat. This 
applies to date type.
timestampFormat (default -MM-dd'T'HH:mm:ss.SSSXXX): sets the string that 
indicates a timestamp format. Custom date formats follow the formats at 
java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns (default 20480): defines a hard limit of how many columns a record 
can have.
maxCharsPerColumn (default -1): defines the maximum number of characters 
allowed for any given value being read. By default, it is -1 meaning unlimited 
length
mode (default PERMISSIVE): allows a mode for dealing with corrupt records 
during parsing. It supports the following case-insensitive modes.
PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a field configured by columnNameOfCorruptRecord. 
To keep corrupt records, an user can set a string type field named 
columnNameOfCorruptRecord in an user-defined schema. If a schema does not have 
the field, it drops corrupt records during parsing. When a length of parsed CSV 
tokens is shorter than an expected length of a schema, it sets null for extra 
fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in 
spark.sql.columnNameOfCorruptRecord): allows renaming the new field having 
malformed string created by PERMISSIVE mode. This overrides 
spark.sql.columnNameOfCorruptRecord.
multiLine (default false): parse one record, which may span multiple lines.
{code}

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the 

[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide

2017-10-04 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191232#comment-16191232
 ] 

Jorge Machado commented on SPARK-20055:
---

[~hyukjin.kwon] just added some docs. 

> Documentation for CSV datasets in SQL programming guide
> ---
>
> Key: SPARK-20055
> URL: https://issues.apache.org/jira/browse/SPARK-20055
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> I guess things commonly used and important are documented there rather than 
> documenting everything and every option in the programming guide - 
> http://spark.apache.org/docs/latest/sql-programming-guide.html.
> It seems JSON datasets 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> are documented whereas CSV datasets are not. 
> Nowadays, they are pretty similar in APIs and options. Some options are 
> notable for both, In particular, ones such as {{wholeFile}}. Moreover, 
> several options such as {{inferSchema}} and {{header}} are important in CSV 
> that affect the type/column name of data.
> In that sense, I think we might better document CSV datasets with some 
> examples too because I believe reading CSV is pretty much common use cases.
> Also, I think we could also leave some pointers for options of API 
> documentations for both (rather than duplicating the documentation).
> So, my suggestion is,
> - Add CSV Datasets section.
> - Add links for options for both JSON and CSV that point each API 
> documentation
> - Fix trivial minor fixes together in both sections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:52 PM:


[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r) and on 
my example it tries to do instanceOf[MutableLong] from a String which fails.  I 
have a filter on the dataframe and a groupby count

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}


was (Author: jomach):
[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r)

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,

[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:47 PM:


[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r)

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}


was (Author: jomach):
[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema. 

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
 

[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado commented on SPARK-5236:
--

[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema. 

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}

> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> -
>
> Key: SPARK-5236
> URL: https://issues.apache.org/jira/browse/SPARK-5236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Alex Baretta
>
> {code}
> 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
> (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
> at 
> 

[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-26 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840227#comment-15840227
 ] 

Jorge Machado commented on SPARK-15505:
---

Sometimes you read from a database that has an array in it, at least was in my 
case. But fell free to reject it. 

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2017-01-25 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15839365#comment-15839365
 ] 

Jorge Machado commented on SPARK-15505:
---

[~hyukjin.kwon] you don't really always know if you have two our  five elements 
in the second column... 

> Explode nested Array in DF Column into Multiple Columns 
> 
>
> Key: SPARK-15505
> URL: https://issues.apache.org/jira/browse/SPARK-15505
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Jorge Machado
>Priority: Minor
>
> At the moment if we have a DF like this : 
> {noformat}
> +--+-+
> | Col1 | Col2|
> +--+-+
> |  1   |[2, 3, 4]|
> |  1   |[2, 3, 4]|
> +--+-+
> {noformat}
> There is no way to directly transform it into : 
> {noformat}
> +--+--+--+--+
> | Col1 | Col2 | Col3 | Col4 |
> +--+--+--+--+
> |  1   |  2   |  3   |  4   |
> |  1   |  2   |  3   |  4   |
> +--+--+--+--+ 
> {noformat}
> I think this should be easy to implement
> More infos here : 
> http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18858) reduceByKey not avaiable on Dataset

2016-12-14 Thread Jorge Machado (JIRA)
Jorge Machado created SPARK-18858:
-

 Summary: reduceByKey not avaiable on Dataset
 Key: SPARK-18858
 URL: https://issues.apache.org/jira/browse/SPARK-18858
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.2
Reporter: Jorge Machado
Priority: Minor


Hi, 

I don´t really know if this is a bug or not.
But having a Dataset it should be possible to do reduceByKey or not ? 

at the moment I have done the workaround with ds.rdd.ReduceByKey.

reduce does not have the same signature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15505) Explode nested Array in DF Column into Multiple Columns

2016-05-24 Thread Jorge Machado (JIRA)
Jorge Machado created SPARK-15505:
-

 Summary: Explode nested Array in DF Column into Multiple Columns 
 Key: SPARK-15505
 URL: https://issues.apache.org/jira/browse/SPARK-15505
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.1
Reporter: Jorge Machado
Priority: Minor


At the moment if we have a DF like this : 

+--+-+
| Col1 | Col2|
+--+-+
|  1   |[2, 3, 4]|
|  1   |[2, 3, 4]|
+--+-+

There is no way to directly transform it into : 

+--+--+--+--+
| Col1 | Col2 | Col3 | Col4 |
+--+--+--+--+
|  1   |  2   |  3   |  4   |
|  1   |  2   |  3   |  4   |
+--+--+--+--+ 

I think this should be easy to implement

More infos here : 
http://stackoverflow.com/questions/37391241/explode-spark-columns/37392793#37392793



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-05-04 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15270350#comment-15270350
 ] 

Jorge Machado commented on SPARK-12988:
---

+1

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10286) Add @since annotation to pyspark.ml.param and pyspark.ml.*

2015-09-18 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805180#comment-14805180
 ] 

Jorge Machado commented on SPARK-10286:
---

Hi, I´m new here. First post. 

I think this issue was already solved on : SPARK-10281.

> Add @since annotation to pyspark.ml.param and pyspark.ml.*
> --
>
> Key: SPARK-10286
> URL: https://issues.apache.org/jira/browse/SPARK-10286
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org