[jira] [Assigned] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-17 Thread Jakub Nowacki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Nowacki reassigned SPARK-22495:
-

Assignee: Jakub Nowacki

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jakub Nowacki
>Priority: Minor
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22212) Some SQL functions in Python fail with string column name

2017-10-10 Thread Jakub Nowacki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Nowacki resolved SPARK-22212.
---
Resolution: Later

Keeping the resolution on-hold until API unification consensus will be reached. 

> Some SQL functions in Python fail with string column name 
> --
>
> Key: SPARK-22212
> URL: https://issues.apache.org/jira/browse/SPARK-22212
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>Priority: Minor
>
> Most of the functions in {{pyspark.sql.functions}} allow usage of both column 
> name string and {{Column}} object. But there are some functions, like 
> {{trim}}, that require to pass only {{Column}}. See below code for 
> explanation.
> {code}
> >>> import pyspark.sql.functions as func
> >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"])
> >>> df.select(func.trim(df["text"])).show()
> +--+
> |trim(text)|
> +--+
> | a|
> | b|
> | c|
> | d|
> | e|
> +--+
> >>> df.select(func.trim("text")).show()
> [...]
> Py4JError: An error occurred while calling 
> z:org.apache.spark.sql.functions.trim. Trace:
> py4j.Py4JException: Method trim([class java.lang.String]) does not exist
> at 
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
> at 
> py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
> at py4j.Gateway.invoke(Gateway.java:274)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> This is because most of the Python function calls map column name to Column 
> in the Python function mapping, but functions created via 
> {{_create_function}} pass them as is, if they are not {{Column}}.
> I am preparing PR with the proposed fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22212) Some SQL functions in Python fail with string column name

2017-10-06 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-22212:
-

 Summary: Some SQL functions in Python fail with string column name 
 Key: SPARK-22212
 URL: https://issues.apache.org/jira/browse/SPARK-22212
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Jakub Nowacki
Priority: Minor


Most of the functions in {{pyspark.sql.functions}} allow usage of both column 
name string and {{Column}} object. But there are some functions, like {{trim}}, 
that require to pass only {{Column}}. See below code for explanation.

{code}
>>> import pyspark.sql.functions as func
>>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"])
>>> df.select(func.trim(df["text"])).show()
+--+
|trim(text)|
+--+
| a|
| b|
| c|
| d|
| e|
+--+
>>> df.select(func.trim("text")).show()
[...]
Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.trim. Trace:
py4j.Py4JException: Method trim([class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
{code}

This is because most of the Python function calls map column name to Column in 
the Python function mapping, but functions created via {{_create_function}} 
pass them as is, if they are not {{Column}}.

I am preparing PR with the proposed fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-27 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16183077#comment-16183077
 ] 

Jakub Nowacki commented on SPARK-18136:
---

PR 19370 (https://github.com/apache/spark/pull/19370) fixes {{SPARK_HOME}} 
issue using {{find_spark_home.py}} script. It's not maybe the most elegant way, 
but it is simple.

I think in a long run it would be better to move to some Python packaging 
mechanisms like {{console_scripts}} or related, as it will provide better 
multiplatform support; see 
https://packaging.python.org/tutorials/distributing-packages/#scripts and 
https://setuptools.readthedocs.io/en/latest/setuptools.html#automatic-script-creation.
 I'll create a separate issue with improvement proposal.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-25 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179746#comment-16179746
 ] 

Jakub Nowacki commented on SPARK-18136:
---

I can come back to this issue this Wednesday I think. I did some preliminary 
tests with {{find_spark_home.py}} but I won't have time to sit to it until 
Wednesday.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-23 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177795#comment-16177795
 ] 

Jakub Nowacki commented on SPARK-18136:
---

I've looked into it again and noticed the Bash script {{find_spark_home}}, 
which is used in Bash version of {{pyspark}} command. The Python script 
{{find_spark_home.py}} seems to return the correct SPARK_HOME path on Windows, 
so the all the cmd-files should be alter somehow to use it instead of 
{{%~dp0}}. I'll look into it when I have time, maybe next week, and propose 
something similar to the {{find_spark_home}} script approach.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175303#comment-16175303
 ] 

Jakub Nowacki commented on SPARK-18136:
---

I've tried using Windows command {{mklink}} to create symbolic links, but it 
seems to resolve the folder in {{%~dp0}} to the Scripts folder 
{{C:\Tools\Anaconda3\Scripts\}}.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-09-21 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175276#comment-16175276
 ] 

Jakub Nowacki commented on SPARK-18136:
---

[PR|https://github.com/apache/spark/pull/19310] fixes how {{spark-class2.cmd}} 
looks for jars directory on Windows. It fails to find jars and start JVM as the 
condition for the env variable {{SPARK_JARS_DIR}} looks for 
{{%SPARK_HOME%\RELEASE}}, which is not included in the {{pip/conda}} build. 
Instead, it should look for {{%SPARK_HOME%\jars}}, which it is later referring 
to.

The above fixes the errors while importing {{pyspark}} into Python and creating 
SparkSession, but there is still an issue calling {{pyspark.cmd}}. Namely, 
normal command call on command line, without path specification fails with 
{{System cannot find the path specified.}}. It is likely due to the script link 
being resolved to Script folder in Anaconda, e.g. 
{{C:\Tools\Anaconda3\Scripts\pyspark.cmd}}. If the script is run via the full 
path to the PySpark package, e.g. 
{{\Tools\Anaconda3\Lib\site-packages\pyspark\bin\pyspark.cmd}} it works fine. 
It is likely due to the fact that {{SPARK_HOME}} is resolved as follows {{set 
SPARK_HOME=%~dp0..}}, which in case of the system call resolves (likely) to 
{{\Tools\Anaconda3\}} and should resolve to 
{{\Tools\Anaconda3\Lib\site-packages\pyspark\}}. Since I dion't know CMD 
scripting that well, I haven't found solution to this issue yet, apart from the 
workaround, i.e. calling it with full (direct) path.   

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-23 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138948#comment-16138948
 ] 

Jakub Nowacki commented on SPARK-21752:
---

OK I did one more extra test and, indeed, on the newest version 2.2.0 (and also 
2.1.1) all three configs work fine; though I'm pretty sure one did not work at 
least once, but maybe this was a coincident. I investigated further, and when I 
rolled back to 2.0.2, which I have on a different setup, only the 
{{PYSPARK_SUBMIT_ARGS}} worked reliably and the other ones didn't; maybe in 
case of this version the {{config}} ones work not deterministically.  Thus, 
this seems to be an issue for versions up to 2.0.2, and for the newer ones it 
seems to work, but not sure, if all the time.

First question is if there is a way to check if fact that it works for 2.1.1 
and 2.2.0 also can stop working on occasion? Also, do we still should have a 
form of documentation for the safer way of configuration?

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-21 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135015#comment-16135015
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~srowen] Do you think we can create some sort of guidelines in documentation 
or as a separate document regarding the usage of configuration, especially in 
the notebook environment as I suggested above? Generally, we can e.g. say that 
most of the config should be in the env variable for instance, but basically 
highlight what is the safest way to use it.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-18 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131945#comment-16131945
 ] 

Jakub Nowacki edited comment on SPARK-21752 at 8/18/17 9:11 AM:


[~skonto] What you are doing is in fact starting manually pyspark 
({{shell.py}}) inside jupyter, which creates SparkSession, so what I written 
above doesn't have any effect as it is the same as running pyspark command.

More Pythonic way of installing it is either adding modules to PYTHONPATH from 
the bundle {{python}} folder (e.g. 
http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very 
similar to what happens when you use {{pip}}/{{conda}} install. Also, I am 
referring to a plain python kernel in Jupyter (or any other python interpreter) 
started without executing {{shell.py}}. BTW you can create kernels in Jupyter 
e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which 
will work as pyspark shell, similar to your setup

While I understand that this is not a desired behavior to use {{master}} or 
{{spark.jars.packages}} in the config, I'd like to work out a preferred way of 
passing configuration options to SparkSession, especially for notebook users. 
Also, my experience is that many of the options other than  {{master}} and 
{{spark.jars.packages}} work quite well with the SparkSession config, e.g. 
{{spark.executor.memory}} etc, which are sometimes need to be tuned to run some 
specific jobs; in a generic jobs I always rely on the defaults, which I often 
tune for a specific cluster.

So my question is: in case we need to add some custom configuration to PySpark 
submission, should interactive Python users:
# add *all* configurations to {{PYSPARK_SUBMIT_ARGS}}
# some configuration like {{master}} or {{packages}} to to 
{{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, 
maybe also saying which ones they are
# we should fix something in SparkSession creation to make SparkSession config 
equally effective to {{PYSPARK_SUBMIT_ARGS}}

Also, sometimes we know that e.g. job (not interactive, run by 
{{spark-submit}}) requires more executor memory or different number of 
partitions. Could we in this case use SparkSession config or each of these 
tuned parameters should be passed via {{spark-submit}} arguments?

I'm happy to extend the documentation with such section for Python users as I 
don't think it's clear currently and would be very useful for python users.


was (Author: jsnowacki):
[~skonto] What you are doing is in fact starting manually pyspark 
({{shell.py}}) inside jupyter, which creates SparkSession, so what I written 
above doesn't have any effect as it is the same as running pyspark command.

More Pythonic way of installing it is either adding modules to PYTHONPATH from 
the bundle {{python}} folder (e.g. 
http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very 
similar to what happens when you use {{pip}}/{{conda}} install. Also, I am 
referring to a plain python kernel in Jupyter (or any other python interpreter) 
started without executing {{shell.py}}. BTW you can create kernels in Jupyter 
e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which 
will work as pyspark shell, similar to your setup

While I understand that not desired behavior to use {{master}} or 
{{spark.jars.packages}} in the config, I'd like to work out a preferred way of 
passing configuration options to SparkSession, especially for notebook users. 
Also, my experience is that many of the options other than  {{master}} and 
{{spark.jars.packages}} work quite well with the SparkSession config, e.g. 
{{spark.executor.memory}} etc, which are sometimes need to be tuned to run some 
specific jobs; in a generic jobs I always rely on the defaults, which I often 
tune for a specific cluster.

So my question is: in case we need to add some custom configuration to PySpark 
submission, should interactive Python users:
# add *all* configurations to {{PYSPARK_SUBMIT_ARGS}}
# some configuration like {{master}} or {{packages}} to to 
{{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, 
maybe also saying which ones they are
# we should fix something in SparkSession creation to make SparkSession config 
equally effective to {{PYSPARK_SUBMIT_ARGS}}

Also, sometimes we know that e.g. job (not interactive, run by 
{{spark-submit}}) requires more executor memory or different number of 
partitions. Could we in this case use SparkSession config or each of these 
tuned parameters should be passed via {{spark-submit}} arguments?

I'm happy to extend the documentation with such section for Python users as I 
don't think it's clear currently and would be very useful for python users.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: 

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-18 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131945#comment-16131945
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~skonto] What you are doing is in fact starting manually pyspark 
({{shell.py}}) inside jupyter, which creates SparkSession, so what I written 
above doesn't have any effect as it is the same as running pyspark command.

More Pythonic way of installing it is either adding modules to PYTHONPATH from 
the bundle {{python}} folder (e.g. 
http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very 
similar to what happens when you use {{pip}}/{{conda}} install. Also, I am 
referring to a plain python kernel in Jupyter (or any other python interpreter) 
started without executing {{shell.py}}. BTW you can create kernels in Jupyter 
e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which 
will work as pyspark shell, similar to your setup

While I understand that not desired behavior to use {{master}} or 
{{spark.jars.packages}} in the config, I'd like to work out a preferred way of 
passing configuration options to SparkSession, especially for notebook users. 
Also, my experience is that many of the options other than  {{master}} and 
{{spark.jars.packages}} work quite well with the SparkSession config, e.g. 
{{spark.executor.memory}} etc, which are sometimes need to be tuned to run some 
specific jobs; in a generic jobs I always rely on the defaults, which I often 
tune for a specific cluster.

So my question is: in case we need to add some custom configuration to PySpark 
submission, should interactive Python users:
# add *all* configurations to {{PYSPARK_SUBMIT_ARGS}}
# some configuration like {{master}} or {{packages}} to to 
{{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, 
maybe also saying which ones they are
# we should fix something in SparkSession creation to make SparkSession config 
equally effective to {{PYSPARK_SUBMIT_ARGS}}

Also, sometimes we know that e.g. job (not interactive, run by 
{{spark-submit}}) requires more executor memory or different number of 
partitions. Could we in this case use SparkSession config or each of these 
tuned parameters should be passed via {{spark-submit}} arguments?

I'm happy to extend the documentation with such section for Python users as I 
don't think it's clear currently and would be very useful for python users.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130282#comment-16130282
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~skonto] Well, I'm not sure where you're failing here. If you want to get 
PySpark installed with a vanilla Python distribution you can do {{pip install 
pyspark}} or {{conda install -c conda-forge pyspark}}. Other that that, the 
above scripts are complete, bar the {{import pyspark}} as I mentioned before. 
Below I give a sightly more complete example with the env variable:
{code}
import os
import pyspark

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

and with the {{SparkConfig}} approach:
{code}
import pyspark

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

and with the plain {{SparkSession}} config:

{code}
import pyspark

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

In my case the first two work as expected, and the last one fails with 
{{ClassNotFoundException}}.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> 

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130274#comment-16130274
 ] 

Jakub Nowacki commented on SPARK-21752:
---

OK I get the point. I think we should only consider this in an interactive, 
notebook based environment. I don't use the master for sure in the 
{{spark-submit}} executioner, but also using packages internally should be 
discouraged. 

I think it should be a bit more clear in documentation what can and what cannot 
be used. Also, interactive environment like Jupyter or similar should be made 
as an exception, or more clear description for setup should be provided.

Also, especially with using the above setting with packages, there is no 
warning provided that this option is really ignored, thus, maybe one should be 
added similar to the one with reusing existing SparkSession, i.e. 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L896

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130150#comment-16130150
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~skonto] Jupyter is not passing many environmental variable. That is why many 
times they are edited inside the notebook as I did above.

[~skonto] [~jerryshao] I still not fully get the discussion here. As described 
in the [Spark SQL 
docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession]
 you can normally create a SparkSession and, to my knowledge, that is done in 
practically all Spark languages as described in the docs. Indeed, when using 
command line shells, like {{pyspark}} you should not do that, but it is OK in 
pure Python. Also, my practice shows that such SparkSession creation works well 
and it is deterministic.



> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Nowacki updated SPARK-21752:
--
Description: 
If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as 
follows:
{code}
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
the SparkSession gets created but there are no package download logs printed, 
and if I use the loaded classes, Mongo connector in this case, but it's the 
same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
missing classes.

If I use the config file {{conf/spark-defaults.comf}}, command line option 
{{--packages}}, e.g.:
{code}
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
{code}
it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
e.g.:
{code}
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}

The above is in Python but I've seen the behavior in other languages, though, I 
didn't check R. 

I also have seen it in older Spark versions.

It seems that this is the only config key that doesn't work for me via the 
{{SparkSession}} builder config.

Note that this is related to creating new {{SparkSession}} as getting new 
packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
{{spark-shell}} as they create the session automatically; it this case one 
would need to use {{--packages}} option. 

  was:
If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as 
follows:
{code}
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
the SparkSession gets created but there are no package download logs printed, 
and if I use the loaded classes, Mongo connector in this case, but it's the 
same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
missing classes.

If I use the config file {{conf/spark-defaults.comf}}, command line option 
{{--packages}}, e.g.:
{code}
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
{code}
it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
e.g.:
{code}
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}

The above is in Python but I've seen the behavior in other languages, though, I 
didn't check R. 

I also have seen it in older Spark versions.

It seems that this is the only config key that doesn't work for me via the 
{{SparkSession}} builder config.


> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get 

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129502#comment-16129502
 ] 

Jakub Nowacki commented on SPARK-21752:
---

Not sure if [SPARK-11520] would help in this case. I'll correct the description 
that this is related to creating SparkSession from scratch.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129475#comment-16129475
 ] 

Jakub Nowacki edited comment on SPARK-21752 at 8/16/17 9:46 PM:


I'm aware you cannot do it with pyspark command as you have a session 
automatically created there. 

We use this spark session creation with Jupyter notebook or some workflow 
scripts (e.g. used in Airflow), so this is pretty much bare Python with pyspark 
being a module; much like creating SparkSession in Scala object's main 
function. I'm assuming you don't have SparkSession running beforehand.

As for the double parenthesis in the first one, yes true, sorry. But it doesn't 
work nonetheless as the parenthesis gives you just a syntax error.


was (Author: jsnowacki):
OK so you don't need session creation with pyspark command line. We use this 
spark session creation with Jupyter notebook, so this is pretty much bare 
Python with pyspark being a module; much like creating SparkSession in Scala 
object's main function. I'm assuming you don't have SparkSession running 
beforehand.

As for the double parenthesis in the first one, yes true, sorry. But it doesn't 
work nonetheless as the parenthesis gives you just a syntax error.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129475#comment-16129475
 ] 

Jakub Nowacki commented on SPARK-21752:
---

OK so you don't need session creation with pyspark command line. We use this 
spark session creation with Jupyter notebook, so this is pretty much bare 
Python with pyspark being a module; much like creating SparkSession in Scala 
object's main function. I'm assuming you don't have SparkSession running 
beforehand.

As for the double parenthesis in the first one, yes true, sorry. But it doesn't 
work nonetheless as the parenthesis gives you just a syntax error.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Nowacki updated SPARK-21752:
--
Description: 
If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as 
follows:
{code}
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
the SparkSession gets created but there are no package download logs printed, 
and if I use the loaded classes, Mongo connector in this case, but it's the 
same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
missing classes.

If I use the config file {{conf/spark-defaults.comf}}, command line option 
{{--packages}}, e.g.:
{code}
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
{code}
it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
e.g.:
{code}
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}

The above is in Python but I've seen the behavior in other languages, though, I 
didn't check R. 

I also have seen it in older Spark versions.

It seems that this is the only config key that doesn't work for me via the 
{{SparkSession}} builder config.

  was:
If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as 
follows:
{code}
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
the SparkSession gets created but there are no package download logs printed, 
and if I use the loaded classes, Mongo connector in this case, but it's the 
same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
missing classes.

If I use the config file {{conf/spark-defaults.comf}}, command line option 
{{--packages}}, e.g.:
{code}
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
{code}
it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
e.g.:
{code}
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}

The above is in Python but I've seen the behavior in other languages, though, I 
didn't check R. 

I also have seen it in older Spark versions.

It seems that this is the only config key that doesn't work for me via the 
{{SparkSession}} builder config.


> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using 

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129466#comment-16129466
 ] 

Jakub Nowacki commented on SPARK-21752:
---

Not really, maybe you're missing {{import pyspark}} at the top but that's it. I 
checked it with at least two different Spark versions set up in two different 
environments, on other versions of Python and it behaves the same, i.e. imports 
the package and works fine. Note that you'd need to change the {{mongo}} in the 
address to a correct one as well, depending on your setup.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129432#comment-16129432
 ] 

Jakub Nowacki edited comment on SPARK-21752 at 8/16/17 9:15 PM:


[~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works 
fine as I explained. Using the latter (example 2) {{SparkConf}} created  before 
{{SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works 
fine as well. Only the version with passing key-values directly to {{config}} 
(examle 1) does not work for me. I tried on different instances of Spark 2+, 
and it behaves the same.


was (Author: jsnowacki):
[~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works 
fine as I explained. Using the latter (example 2) {{SparkConf}} created  before 
{SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works 
fine as well. Only the version with passing key-values directly to {{config}} 
(examle 1) does not work for me. I tried on different instances of Spark 2+, 
and it behaves the same.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129432#comment-16129432
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works 
fine as I explained. Using the latter (example 2) {{SparkConf}} created  before 
{SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works 
fine as well. Only the version with passing key-values directly to {{config}} 
(examle 1) does not work for me. I tried on different instances of Spark 2+, 
and it behaves the same.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129322#comment-16129322
 ] 

Jakub Nowacki commented on SPARK-21752:
---

Well, it seems so, but it is, at least logically, unclear why passing 
{{SparkConf}} via {{SparkSession}} config works, whereas, using key-value 
doesn't. It should be at least mentioned somewhere in the documentation, but 
currently neither 
[Configuration|https://spark.apache.org/docs/latest/configuration.html]  nor 
[Spark SQL 
guide|https://spark.apache.org/docs/latest/sql-programming-guide.html] says 
anything about that.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-16 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-21752:
-

 Summary: Config spark.jars.packages is ignored in SparkSession 
config
 Key: SPARK-21752
 URL: https://issues.apache.org/jira/browse/SPARK-21752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jakub Nowacki


If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as 
follows:
{code}
spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()
{code}
the SparkSession gets created but there are no package download logs printed, 
and if I use the loaded classes, Mongo connector in this case, but it's the 
same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
missing classes.

If I use the config file {{conf/spark-defaults.comf}}, command line option 
{{--packages}}, e.g.:
{code}
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
{code}
it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
e.g.:
{code}
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
{code}

The above is in Python but I've seen the behavior in other languages, though, I 
didn't check R. 

I also have seen it in older Spark versions.

It seems that this is the only config key that doesn't work for me via the 
{{SparkSession}} builder config.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-22 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936421#comment-15936421
 ] 

Jakub Nowacki commented on SPARK-20049:
---

I did a bit more and the writing and, as it came out, reading performance was 
low due to the number of files per partition most likely. Namely, every folder 
contained the number of files corresponding to the number of partitions of 
saved DataFrame, which was just over 3000 in my case.  Repartitioning like:
{code}
# there is column 'date' in df
df.repartition("date").write.partitionBy("date").parquet("dest_dir")
{code}
fixes the issue, though, creates one file per partition, which is a bit too 
much in my case, but this can be fixed e.g.:
{code}
# there is column 'date' in df
df.repartition("date", 
hour("createdAt")).write.partitionBy("date").parquet("dest_dir")
{code}
which works similarly but files in the partition folders are smaller.

So IMO there are 4 issues to address:
# for some reason there is a long time of writing files on HDFS, which is not 
indicated anywhere and takes much longer than normal write (in my case 5 
minutes vs 1.5 hour)
# some form of additional progress indicator should be included somewhere in 
UI, logs and/or shell output
# suggestion of repartitioning before using {{partitionBy}} should be 
highlighted in the documentation
# maybe automatic repartitioning before saving should be considered, though, 
this can be controversial

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-21 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-20049:
-

 Summary: Writing data to Parquet with partitions takes very long 
after the job finishes
 Key: SPARK-20049
 URL: https://issues.apache.org/jira/browse/SPARK-20049
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, PySpark, SQL
Affects Versions: 2.1.0
 Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
GNU/Linux 8.7 (jessie)
Reporter: Jakub Nowacki


I was testing writing DataFrame to partitioned Parquet files.The command is 
quite straight forward and the data set is really a sample from larger data set 
in Parquet; the job is done in PySpark on YARN and written to HDFS:
{code}
# there is column 'date' in df
df.write.partitionBy("date").parquet("dest_dir")
{code}
The reading part took as long as usual, but after the job has been marked in 
PySpark and UI as finished, the Python interpreter still was showing it as 
busy. Indeed, when I checked the HDFS folder I noticed that the files are still 
transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
folders. 

First of all it takes much longer than saving the same set without 
partitioning. Second, it is done in the background, without visible progress of 
any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-04 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720710#comment-15720710
 ] 

Jakub Nowacki edited comment on SPARK-18699 at 12/4/16 10:26 PM:
-

While I don't argue that some other packages have similar behaviour, I think 
the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have 
very little standards and no types. In ma case I had just one odd value in 
almost 1 TB set and the job crushed at the very end after about an hour. To go 
around the issue one needs to manually parse each line, which is not the end of 
the world, but I wanted to use CSV reader exactly for the confidence of not 
writing extra code. IMO the mode for error detection should be FAILFAST. 
Moreover, if I really need to check the data, I read it differently anyway.
BTW thanks for looking into this.


was (Author: jsnowacki):
While I don't argue that some other packages have similar behaviour, I think 
the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have 
very little standards and no types. In ma case I had just one odd value in 
almost 1 TB set and the job crushed at the very end after about an hour. To go 
around the issue one needs to manually parse each line, which is not the end of 
the world, but I wanted to use CSV reader exactly for the confidence of not 
writing extra code. IMO the mode for error detection should be FAILFAST. 
Moreover, if I really need to check the data, I read it differently anyway.

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or 

[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-04 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720710#comment-15720710
 ] 

Jakub Nowacki commented on SPARK-18699:
---

While I don't argue that some other packages have similar behaviour, I think 
the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have 
very little standards and no types. In ma case I had just one odd value in 
almost 1 TB set and the job crushed at the very end after about an hour. To go 
around the issue one needs to manually parse each line, which is not the end of 
the world, but I wanted to use CSV reader exactly for the confidence of not 
writing extra code. IMO the mode for error detection should be FAILFAST. 
Moreover, if I really need to check the data, I read it differently anyway.

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-04 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719887#comment-15719887
 ] 

Jakub Nowacki commented on SPARK-18699:
---

Yes, my understanding was that it should put nullify the value if it fails to 
parse it in PERMISSIVE mode or drop the whole row (line) in DROPMALFORMED as 
described in the docs: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader,
 i.e.: 
* mode (default PERMISSIVE): allows a mode for dealing with corrupt records 
during parsing.
** PERMISSIVE : sets other fields to null when it meets a corrupted record. 
When a schema is set by user, it sets null for extra fields.
** DROPMALFORMED : ignores the whole corrupted records.
** FAILFAST : throws an exception when it meets corrupted records.

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2016-12-03 Thread Jakub Nowacki (JIRA)
Jakub Nowacki created SPARK-18699:
-

 Summary: Spark CSV parsing types other than String throws 
exception when malformed
 Key: SPARK-18699
 URL: https://issues.apache.org/jira/browse/SPARK-18699
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Jakub Nowacki


If CSV is read and the schema contains any other type than String, exception is 
thrown when the string value in CSV is malformed; e.g. if the timestamp does 
not match the defined one, an exception is thrown:
{code}
Caused by: java.lang.IllegalArgumentException
at java.sql.Date.valueOf(Date.java:143)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
at scala.util.Try.getOrElse(Try.scala:79)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
at 
org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
... 8 more
{code}

It behaves similarly with Integer and Long types, from what I've seen.

To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org