[jira] [Assigned] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows
[ https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Nowacki reassigned SPARK-22495: - Assignee: Jakub Nowacki > Fix setup of SPARK_HOME variable on Windows > --- > > Key: SPARK-22495 > URL: https://issues.apache.org/jira/browse/SPARK-22495 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Jakub Nowacki >Priority: Minor > > On Windows, pip installed pyspark is unable to find out the spark home. There > is already proposed change, sufficient details and discussions in > https://github.com/apache/spark/pull/19370 and SPARK-18136 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22212) Some SQL functions in Python fail with string column name
[ https://issues.apache.org/jira/browse/SPARK-22212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Nowacki resolved SPARK-22212. --- Resolution: Later Keeping the resolution on-hold until API unification consensus will be reached. > Some SQL functions in Python fail with string column name > -- > > Key: SPARK-22212 > URL: https://issues.apache.org/jira/browse/SPARK-22212 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki >Priority: Minor > > Most of the functions in {{pyspark.sql.functions}} allow usage of both column > name string and {{Column}} object. But there are some functions, like > {{trim}}, that require to pass only {{Column}}. See below code for > explanation. > {code} > >>> import pyspark.sql.functions as func > >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"]) > >>> df.select(func.trim(df["text"])).show() > +--+ > |trim(text)| > +--+ > | a| > | b| > | c| > | d| > | e| > +--+ > >>> df.select(func.trim("text")).show() > [...] > Py4JError: An error occurred while calling > z:org.apache.spark.sql.functions.trim. Trace: > py4j.Py4JException: Method trim([class java.lang.String]) does not exist > at > py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at > py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) > at py4j.Gateway.invoke(Gateway.java:274) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:748) > {code} > This is because most of the Python function calls map column name to Column > in the Python function mapping, but functions created via > {{_create_function}} pass them as is, if they are not {{Column}}. > I am preparing PR with the proposed fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22212) Some SQL functions in Python fail with string column name
Jakub Nowacki created SPARK-22212: - Summary: Some SQL functions in Python fail with string column name Key: SPARK-22212 URL: https://issues.apache.org/jira/browse/SPARK-22212 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Jakub Nowacki Priority: Minor Most of the functions in {{pyspark.sql.functions}} allow usage of both column name string and {{Column}} object. But there are some functions, like {{trim}}, that require to pass only {{Column}}. See below code for explanation. {code} >>> import pyspark.sql.functions as func >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"]) >>> df.select(func.trim(df["text"])).show() +--+ |trim(text)| +--+ | a| | b| | c| | d| | e| +--+ >>> df.select(func.trim("text")).show() [...] Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.trim. Trace: py4j.Py4JException: Method trim([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) {code} This is because most of the Python function calls map column name to Column in the Python function mapping, but functions created via {{_create_function}} pass them as is, if they are not {{Column}}. I am preparing PR with the proposed fix. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows
[ https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16183077#comment-16183077 ] Jakub Nowacki commented on SPARK-18136: --- PR 19370 (https://github.com/apache/spark/pull/19370) fixes {{SPARK_HOME}} issue using {{find_spark_home.py}} script. It's not maybe the most elegant way, but it is simple. I think in a long run it would be better to move to some Python packaging mechanisms like {{console_scripts}} or related, as it will provide better multiplatform support; see https://packaging.python.org/tutorials/distributing-packages/#scripts and https://setuptools.readthedocs.io/en/latest/setuptools.html#automatic-script-creation. I'll create a separate issue with improvement proposal. > Make PySpark pip install works on windows > - > > Key: SPARK-18136 > URL: https://issues.apache.org/jira/browse/SPARK-18136 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > Fix For: 2.2.1, 2.3.0, 2.1.3 > > > Make sure that pip installer for PySpark works on windows -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows
[ https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179746#comment-16179746 ] Jakub Nowacki commented on SPARK-18136: --- I can come back to this issue this Wednesday I think. I did some preliminary tests with {{find_spark_home.py}} but I won't have time to sit to it until Wednesday. > Make PySpark pip install works on windows > - > > Key: SPARK-18136 > URL: https://issues.apache.org/jira/browse/SPARK-18136 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > Fix For: 2.2.1, 2.3.0, 2.1.3 > > > Make sure that pip installer for PySpark works on windows -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows
[ https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16177795#comment-16177795 ] Jakub Nowacki commented on SPARK-18136: --- I've looked into it again and noticed the Bash script {{find_spark_home}}, which is used in Bash version of {{pyspark}} command. The Python script {{find_spark_home.py}} seems to return the correct SPARK_HOME path on Windows, so the all the cmd-files should be alter somehow to use it instead of {{%~dp0}}. I'll look into it when I have time, maybe next week, and propose something similar to the {{find_spark_home}} script approach. > Make PySpark pip install works on windows > - > > Key: SPARK-18136 > URL: https://issues.apache.org/jira/browse/SPARK-18136 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > Make sure that pip installer for PySpark works on windows -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows
[ https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175303#comment-16175303 ] Jakub Nowacki commented on SPARK-18136: --- I've tried using Windows command {{mklink}} to create symbolic links, but it seems to resolve the folder in {{%~dp0}} to the Scripts folder {{C:\Tools\Anaconda3\Scripts\}}. > Make PySpark pip install works on windows > - > > Key: SPARK-18136 > URL: https://issues.apache.org/jira/browse/SPARK-18136 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > Make sure that pip installer for PySpark works on windows -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows
[ https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175276#comment-16175276 ] Jakub Nowacki commented on SPARK-18136: --- [PR|https://github.com/apache/spark/pull/19310] fixes how {{spark-class2.cmd}} looks for jars directory on Windows. It fails to find jars and start JVM as the condition for the env variable {{SPARK_JARS_DIR}} looks for {{%SPARK_HOME%\RELEASE}}, which is not included in the {{pip/conda}} build. Instead, it should look for {{%SPARK_HOME%\jars}}, which it is later referring to. The above fixes the errors while importing {{pyspark}} into Python and creating SparkSession, but there is still an issue calling {{pyspark.cmd}}. Namely, normal command call on command line, without path specification fails with {{System cannot find the path specified.}}. It is likely due to the script link being resolved to Script folder in Anaconda, e.g. {{C:\Tools\Anaconda3\Scripts\pyspark.cmd}}. If the script is run via the full path to the PySpark package, e.g. {{\Tools\Anaconda3\Lib\site-packages\pyspark\bin\pyspark.cmd}} it works fine. It is likely due to the fact that {{SPARK_HOME}} is resolved as follows {{set SPARK_HOME=%~dp0..}}, which in case of the system call resolves (likely) to {{\Tools\Anaconda3\}} and should resolve to {{\Tools\Anaconda3\Lib\site-packages\pyspark\}}. Since I dion't know CMD scripting that well, I haven't found solution to this issue yet, apart from the workaround, i.e. calling it with full (direct) path. > Make PySpark pip install works on windows > - > > Key: SPARK-18136 > URL: https://issues.apache.org/jira/browse/SPARK-18136 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > Make sure that pip installer for PySpark works on windows -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16138948#comment-16138948 ] Jakub Nowacki commented on SPARK-21752: --- OK I did one more extra test and, indeed, on the newest version 2.2.0 (and also 2.1.1) all three configs work fine; though I'm pretty sure one did not work at least once, but maybe this was a coincident. I investigated further, and when I rolled back to 2.0.2, which I have on a different setup, only the {{PYSPARK_SUBMIT_ARGS}} worked reliably and the other ones didn't; maybe in case of this version the {{config}} ones work not deterministically. Thus, this seems to be an issue for versions up to 2.0.2, and for the newer ones it seems to work, but not sure, if all the time. First question is if there is a way to check if fact that it works for 2.1.1 and 2.2.0 also can stop working on occasion? Also, do we still should have a form of documentation for the safer way of configuration? > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135015#comment-16135015 ] Jakub Nowacki commented on SPARK-21752: --- [~srowen] Do you think we can create some sort of guidelines in documentation or as a separate document regarding the usage of configuration, especially in the notebook environment as I suggested above? Generally, we can e.g. say that most of the config should be in the env variable for instance, but basically highlight what is the safest way to use it. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131945#comment-16131945 ] Jakub Nowacki edited comment on SPARK-21752 at 8/18/17 9:11 AM: [~skonto] What you are doing is in fact starting manually pyspark ({{shell.py}}) inside jupyter, which creates SparkSession, so what I written above doesn't have any effect as it is the same as running pyspark command. More Pythonic way of installing it is either adding modules to PYTHONPATH from the bundle {{python}} folder (e.g. http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very similar to what happens when you use {{pip}}/{{conda}} install. Also, I am referring to a plain python kernel in Jupyter (or any other python interpreter) started without executing {{shell.py}}. BTW you can create kernels in Jupyter e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which will work as pyspark shell, similar to your setup While I understand that this is not a desired behavior to use {{master}} or {{spark.jars.packages}} in the config, I'd like to work out a preferred way of passing configuration options to SparkSession, especially for notebook users. Also, my experience is that many of the options other than {{master}} and {{spark.jars.packages}} work quite well with the SparkSession config, e.g. {{spark.executor.memory}} etc, which are sometimes need to be tuned to run some specific jobs; in a generic jobs I always rely on the defaults, which I often tune for a specific cluster. So my question is: in case we need to add some custom configuration to PySpark submission, should interactive Python users: # add *all* configurations to {{PYSPARK_SUBMIT_ARGS}} # some configuration like {{master}} or {{packages}} to to {{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, maybe also saying which ones they are # we should fix something in SparkSession creation to make SparkSession config equally effective to {{PYSPARK_SUBMIT_ARGS}} Also, sometimes we know that e.g. job (not interactive, run by {{spark-submit}}) requires more executor memory or different number of partitions. Could we in this case use SparkSession config or each of these tuned parameters should be passed via {{spark-submit}} arguments? I'm happy to extend the documentation with such section for Python users as I don't think it's clear currently and would be very useful for python users. was (Author: jsnowacki): [~skonto] What you are doing is in fact starting manually pyspark ({{shell.py}}) inside jupyter, which creates SparkSession, so what I written above doesn't have any effect as it is the same as running pyspark command. More Pythonic way of installing it is either adding modules to PYTHONPATH from the bundle {{python}} folder (e.g. http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very similar to what happens when you use {{pip}}/{{conda}} install. Also, I am referring to a plain python kernel in Jupyter (or any other python interpreter) started without executing {{shell.py}}. BTW you can create kernels in Jupyter e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which will work as pyspark shell, similar to your setup While I understand that not desired behavior to use {{master}} or {{spark.jars.packages}} in the config, I'd like to work out a preferred way of passing configuration options to SparkSession, especially for notebook users. Also, my experience is that many of the options other than {{master}} and {{spark.jars.packages}} work quite well with the SparkSession config, e.g. {{spark.executor.memory}} etc, which are sometimes need to be tuned to run some specific jobs; in a generic jobs I always rely on the defaults, which I often tune for a specific cluster. So my question is: in case we need to add some custom configuration to PySpark submission, should interactive Python users: # add *all* configurations to {{PYSPARK_SUBMIT_ARGS}} # some configuration like {{master}} or {{packages}} to to {{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, maybe also saying which ones they are # we should fix something in SparkSession creation to make SparkSession config equally effective to {{PYSPARK_SUBMIT_ARGS}} Also, sometimes we know that e.g. job (not interactive, run by {{spark-submit}}) requires more executor memory or different number of partitions. Could we in this case use SparkSession config or each of these tuned parameters should be passed via {{spark-submit}} arguments? I'm happy to extend the documentation with such section for Python users as I don't think it's clear currently and would be very useful for python users. > Config spark.jars.packages is ignored in SparkSession config > > > Key:
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131945#comment-16131945 ] Jakub Nowacki commented on SPARK-21752: --- [~skonto] What you are doing is in fact starting manually pyspark ({{shell.py}}) inside jupyter, which creates SparkSession, so what I written above doesn't have any effect as it is the same as running pyspark command. More Pythonic way of installing it is either adding modules to PYTHONPATH from the bundle {{python}} folder (e.g. http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very similar to what happens when you use {{pip}}/{{conda}} install. Also, I am referring to a plain python kernel in Jupyter (or any other python interpreter) started without executing {{shell.py}}. BTW you can create kernels in Jupyter e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which will work as pyspark shell, similar to your setup While I understand that not desired behavior to use {{master}} or {{spark.jars.packages}} in the config, I'd like to work out a preferred way of passing configuration options to SparkSession, especially for notebook users. Also, my experience is that many of the options other than {{master}} and {{spark.jars.packages}} work quite well with the SparkSession config, e.g. {{spark.executor.memory}} etc, which are sometimes need to be tuned to run some specific jobs; in a generic jobs I always rely on the defaults, which I often tune for a specific cluster. So my question is: in case we need to add some custom configuration to PySpark submission, should interactive Python users: # add *all* configurations to {{PYSPARK_SUBMIT_ARGS}} # some configuration like {{master}} or {{packages}} to to {{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, maybe also saying which ones they are # we should fix something in SparkSession creation to make SparkSession config equally effective to {{PYSPARK_SUBMIT_ARGS}} Also, sometimes we know that e.g. job (not interactive, run by {{spark-submit}}) requires more executor memory or different number of partitions. Could we in this case use SparkSession config or each of these tuned parameters should be passed via {{spark-submit}} arguments? I'm happy to extend the documentation with such section for Python users as I don't think it's clear currently and would be very useful for python users. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option.
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130282#comment-16130282 ] Jakub Nowacki commented on SPARK-21752: --- [~skonto] Well, I'm not sure where you're failing here. If you want to get PySpark installed with a vanilla Python distribution you can do {{pip install pyspark}} or {{conda install -c conda-forge pyspark}}. Other that that, the above scripts are complete, bar the {{import pyspark}} as I mentioned before. Below I give a sightly more complete example with the env variable: {code} import os import pyspark os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} and with the {{SparkConfig}} approach: {code} import pyspark conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} and with the plain {{SparkSession}} config: {code} import pyspark spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() l = [("Bilbo Baggins", 50), ("Gandalf", 1000), ("Thorin", 195), ("Balin", 178), ("Kili", 77), ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", None)] people = spark.createDataFrame(l, ["name", "age"]) people.write \ .format("com.mongodb.spark.sql.DefaultSource") \ .mode("append") \ .save() spark.read \ .format("com.mongodb.spark.sql.DefaultSource") \ .load() \ .show() {code} In my case the first two work as expected, and the last one fails with {{ClassNotFoundException}}. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", >
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130274#comment-16130274 ] Jakub Nowacki commented on SPARK-21752: --- OK I get the point. I think we should only consider this in an interactive, notebook based environment. I don't use the master for sure in the {{spark-submit}} executioner, but also using packages internally should be discouraged. I think it should be a bit more clear in documentation what can and what cannot be used. Also, interactive environment like Jupyter or similar should be made as an exception, or more clear description for setup should be provided. Also, especially with using the above setting with packages, there is no warning provided that this option is really ignored, thus, maybe one should be added similar to the one with reusing existing SparkSession, i.e. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L896 > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130150#comment-16130150 ] Jakub Nowacki commented on SPARK-21752: --- [~skonto] Jupyter is not passing many environmental variable. That is why many times they are edited inside the notebook as I did above. [~skonto] [~jerryshao] I still not fully get the discussion here. As described in the [Spark SQL docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession] you can normally create a SparkSession and, to my knowledge, that is done in practically all Spark languages as described in the docs. Indeed, when using command line shells, like {{pyspark}} you should not do that, but it is OK in pure Python. Also, my practice shows that such SparkSession creation works well and it is deterministic. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Nowacki updated SPARK-21752: -- Description: If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as follows: {code} spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() {code} the SparkSession gets created but there are no package download logs printed, and if I use the loaded classes, Mongo connector in this case, but it's the same for other packages, I get {{java.lang.ClassNotFoundException}} for the missing classes. If I use the config file {{conf/spark-defaults.comf}}, command line option {{--packages}}, e.g.: {code} import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' {code} it works fine. Interestingly, using {{SparkConf}} object works fine as well, e.g.: {code} conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() {code} The above is in Python but I've seen the behavior in other languages, though, I didn't check R. I also have seen it in older Spark versions. It seems that this is the only config key that doesn't work for me via the {{SparkSession}} builder config. Note that this is related to creating new {{SparkSession}} as getting new packages into existing {{SparkSession}} doesn't indeed make sense. Thus this will only work with bare Python, Scala or Java, and not on {{pyspark}} or {{spark-shell}} as they create the session automatically; it this case one would need to use {{--packages}} option. was: If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as follows: {code} spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() {code} the SparkSession gets created but there are no package download logs printed, and if I use the loaded classes, Mongo connector in this case, but it's the same for other packages, I get {{java.lang.ClassNotFoundException}} for the missing classes. If I use the config file {{conf/spark-defaults.comf}}, command line option {{--packages}}, e.g.: {code} import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' {code} it works fine. Interestingly, using {{SparkConf}} object works fine as well, e.g.: {code} conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() {code} The above is in Python but I've seen the behavior in other languages, though, I didn't check R. I also have seen it in older Spark versions. It seems that this is the only config key that doesn't work for me via the {{SparkSession}} builder config. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129502#comment-16129502 ] Jakub Nowacki commented on SPARK-21752: --- Not sure if [SPARK-11520] would help in this case. I'll correct the description that this is related to creating SparkSession from scratch. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129475#comment-16129475 ] Jakub Nowacki edited comment on SPARK-21752 at 8/16/17 9:46 PM: I'm aware you cannot do it with pyspark command as you have a session automatically created there. We use this spark session creation with Jupyter notebook or some workflow scripts (e.g. used in Airflow), so this is pretty much bare Python with pyspark being a module; much like creating SparkSession in Scala object's main function. I'm assuming you don't have SparkSession running beforehand. As for the double parenthesis in the first one, yes true, sorry. But it doesn't work nonetheless as the parenthesis gives you just a syntax error. was (Author: jsnowacki): OK so you don't need session creation with pyspark command line. We use this spark session creation with Jupyter notebook, so this is pretty much bare Python with pyspark being a module; much like creating SparkSession in Scala object's main function. I'm assuming you don't have SparkSession running beforehand. As for the double parenthesis in the first one, yes true, sorry. But it doesn't work nonetheless as the parenthesis gives you just a syntax error. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129475#comment-16129475 ] Jakub Nowacki commented on SPARK-21752: --- OK so you don't need session creation with pyspark command line. We use this spark session creation with Jupyter notebook, so this is pretty much bare Python with pyspark being a module; much like creating SparkSession in Scala object's main function. I'm assuming you don't have SparkSession running beforehand. As for the double parenthesis in the first one, yes true, sorry. But it doesn't work nonetheless as the parenthesis gives you just a syntax error. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jakub Nowacki updated SPARK-21752: -- Description: If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as follows: {code} spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() {code} the SparkSession gets created but there are no package download logs printed, and if I use the loaded classes, Mongo connector in this case, but it's the same for other packages, I get {{java.lang.ClassNotFoundException}} for the missing classes. If I use the config file {{conf/spark-defaults.comf}}, command line option {{--packages}}, e.g.: {code} import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' {code} it works fine. Interestingly, using {{SparkConf}} object works fine as well, e.g.: {code} conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() {code} The above is in Python but I've seen the behavior in other languages, though, I didn't check R. I also have seen it in older Spark versions. It seems that this is the only config key that doesn't work for me via the {{SparkSession}} builder config. was: If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as follows: {code} spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() {code} the SparkSession gets created but there are no package download logs printed, and if I use the loaded classes, Mongo connector in this case, but it's the same for other packages, I get {{java.lang.ClassNotFoundException}} for the missing classes. If I use the config file {{conf/spark-defaults.comf}}, command line option {{--packages}}, e.g.: {code} import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' {code} it works fine. Interestingly, using {{SparkConf}} object works fine as well, e.g.: {code} conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() {code} The above is in Python but I've seen the behavior in other languages, though, I didn't check R. I also have seen it in older Spark versions. It seems that this is the only config key that doesn't work for me via the {{SparkSession}} builder config. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129466#comment-16129466 ] Jakub Nowacki commented on SPARK-21752: --- Not really, maybe you're missing {{import pyspark}} at the top but that's it. I checked it with at least two different Spark versions set up in two different environments, on other versions of Python and it behaves the same, i.e. imports the package and works fine. Note that you'd need to change the {{mongo}} in the address to a correct one as well, depending on your setup. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129432#comment-16129432 ] Jakub Nowacki edited comment on SPARK-21752 at 8/16/17 9:15 PM: [~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works fine as I explained. Using the latter (example 2) {{SparkConf}} created before {{SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works fine as well. Only the version with passing key-values directly to {{config}} (examle 1) does not work for me. I tried on different instances of Spark 2+, and it behaves the same. was (Author: jsnowacki): [~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works fine as I explained. Using the latter (example 2) {{SparkConf}} created before {SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works fine as well. Only the version with passing key-values directly to {{config}} (examle 1) does not work for me. I tried on different instances of Spark 2+, and it behaves the same. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129432#comment-16129432 ] Jakub Nowacki commented on SPARK-21752: --- [~skonto] Not sure which one you couldn't reproduce. Using {{--packages}} works fine as I explained. Using the latter (example 2) {{SparkConf}} created before {SparkSession}} and passing it to the builder via {{.config(conf=conf)}} works fine as well. Only the version with passing key-values directly to {{config}} (examle 1) does not work for me. I tried on different instances of Spark 2+, and it behaves the same. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129322#comment-16129322 ] Jakub Nowacki commented on SPARK-21752: --- Well, it seems so, but it is, at least logically, unclear why passing {{SparkConf}} via {{SparkSession}} config works, whereas, using key-value doesn't. It should be at least mentioned somewhere in the documentation, but currently neither [Configuration|https://spark.apache.org/docs/latest/configuration.html] nor [Spark SQL guide|https://spark.apache.org/docs/latest/sql-programming-guide.html] says anything about that. > Config spark.jars.packages is ignored in SparkSession config > > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config
Jakub Nowacki created SPARK-21752: - Summary: Config spark.jars.packages is ignored in SparkSession config Key: SPARK-21752 URL: https://issues.apache.org/jira/browse/SPARK-21752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Jakub Nowacki If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder as follows: {code} spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0"))\ .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ .getOrCreate() {code} the SparkSession gets created but there are no package download logs printed, and if I use the loaded classes, Mongo connector in this case, but it's the same for other packages, I get {{java.lang.ClassNotFoundException}} for the missing classes. If I use the config file {{conf/spark-defaults.comf}}, command line option {{--packages}}, e.g.: {code} import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' {code} it works fine. Interestingly, using {{SparkConf}} object works fine as well, e.g.: {code} conf = pyspark.SparkConf() conf.set("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") spark = pyspark.sql.SparkSession.builder\ .appName('test-mongo')\ .master('local[*]')\ .config(conf=conf)\ .getOrCreate() {code} The above is in Python but I've seen the behavior in other languages, though, I didn't check R. I also have seen it in older Spark versions. It seems that this is the only config key that doesn't work for me via the {{SparkSession}} builder config. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes
[ https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936421#comment-15936421 ] Jakub Nowacki commented on SPARK-20049: --- I did a bit more and the writing and, as it came out, reading performance was low due to the number of files per partition most likely. Namely, every folder contained the number of files corresponding to the number of partitions of saved DataFrame, which was just over 3000 in my case. Repartitioning like: {code} # there is column 'date' in df df.repartition("date").write.partitionBy("date").parquet("dest_dir") {code} fixes the issue, though, creates one file per partition, which is a bit too much in my case, but this can be fixed e.g.: {code} # there is column 'date' in df df.repartition("date", hour("createdAt")).write.partitionBy("date").parquet("dest_dir") {code} which works similarly but files in the partition folders are smaller. So IMO there are 4 issues to address: # for some reason there is a long time of writing files on HDFS, which is not indicated anywhere and takes much longer than normal write (in my case 5 minutes vs 1.5 hour) # some form of additional progress indicator should be included somewhere in UI, logs and/or shell output # suggestion of repartitioning before using {{partitionBy}} should be highlighted in the documentation # maybe automatic repartitioning before saving should be considered, though, this can be controversial > Writing data to Parquet with partitions takes very long after the job finishes > -- > > Key: SPARK-20049 > URL: https://issues.apache.org/jira/browse/SPARK-20049 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, SQL >Affects Versions: 2.1.0 > Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian > GNU/Linux 8.7 (jessie) >Reporter: Jakub Nowacki > > I was testing writing DataFrame to partitioned Parquet files.The command is > quite straight forward and the data set is really a sample from larger data > set in Parquet; the job is done in PySpark on YARN and written to HDFS: > {code} > # there is column 'date' in df > df.write.partitionBy("date").parquet("dest_dir") > {code} > The reading part took as long as usual, but after the job has been marked in > PySpark and UI as finished, the Python interpreter still was showing it as > busy. Indeed, when I checked the HDFS folder I noticed that the files are > still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} > folders. > First of all it takes much longer than saving the same set without > partitioning. Second, it is done in the background, without visible progress > of any kind. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes
Jakub Nowacki created SPARK-20049: - Summary: Writing data to Parquet with partitions takes very long after the job finishes Key: SPARK-20049 URL: https://issues.apache.org/jira/browse/SPARK-20049 Project: Spark Issue Type: Bug Components: Input/Output, PySpark, SQL Affects Versions: 2.1.0 Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian GNU/Linux 8.7 (jessie) Reporter: Jakub Nowacki I was testing writing DataFrame to partitioned Parquet files.The command is quite straight forward and the data set is really a sample from larger data set in Parquet; the job is done in PySpark on YARN and written to HDFS: {code} # there is column 'date' in df df.write.partitionBy("date").parquet("dest_dir") {code} The reading part took as long as usual, but after the job has been marked in PySpark and UI as finished, the Python interpreter still was showing it as busy. Indeed, when I checked the HDFS folder I noticed that the files are still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} folders. First of all it takes much longer than saving the same set without partitioning. Second, it is done in the background, without visible progress of any kind. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720710#comment-15720710 ] Jakub Nowacki edited comment on SPARK-18699 at 12/4/16 10:26 PM: - While I don't argue that some other packages have similar behaviour, I think the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have very little standards and no types. In ma case I had just one odd value in almost 1 TB set and the job crushed at the very end after about an hour. To go around the issue one needs to manually parse each line, which is not the end of the world, but I wanted to use CSV reader exactly for the confidence of not writing extra code. IMO the mode for error detection should be FAILFAST. Moreover, if I really need to check the data, I read it differently anyway. BTW thanks for looking into this. was (Author: jsnowacki): While I don't argue that some other packages have similar behaviour, I think the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have very little standards and no types. In ma case I had just one odd value in almost 1 TB set and the job crushed at the very end after about an hour. To go around the issue one needs to manually parse each line, which is not the end of the world, but I wanted to use CSV reader exactly for the confidence of not writing extra code. IMO the mode for error detection should be FAILFAST. Moreover, if I really need to check the data, I read it differently anyway. > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or
[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720710#comment-15720710 ] Jakub Nowacki commented on SPARK-18699: --- While I don't argue that some other packages have similar behaviour, I think the PERMISSIVE mode should be, well, as permissive as possible, since CSVs have very little standards and no types. In ma case I had just one odd value in almost 1 TB set and the job crushed at the very end after about an hour. To go around the issue one needs to manually parse each line, which is not the end of the world, but I wanted to use CSV reader exactly for the confidence of not writing extra code. IMO the mode for error detection should be FAILFAST. Moreover, if I really need to check the data, I read it differently anyway. > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
[ https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719887#comment-15719887 ] Jakub Nowacki commented on SPARK-18699: --- Yes, my understanding was that it should put nullify the value if it fails to parse it in PERMISSIVE mode or drop the whole row (line) in DROPMALFORMED as described in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader, i.e.: * mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. ** PERMISSIVE : sets other fields to null when it meets a corrupted record. When a schema is set by user, it sets null for extra fields. ** DROPMALFORMED : ignores the whole corrupted records. ** FAILFAST : throws an exception when it meets corrupted records. > Spark CSV parsing types other than String throws exception when malformed > - > > Key: SPARK-18699 > URL: https://issues.apache.org/jira/browse/SPARK-18699 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Jakub Nowacki > > If CSV is read and the schema contains any other type than String, exception > is thrown when the string value in CSV is malformed; e.g. if the timestamp > does not match the defined one, an exception is thrown: > {code} > Caused by: java.lang.IllegalArgumentException > at java.sql.Date.valueOf(Date.java:143) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) > at scala.util.Try.getOrElse(Try.scala:79) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > {code} > It behaves similarly with Integer and Long types, from what I've seen. > To my understanding modes PERMISSIVE and DROPMALFORMED should just null the > value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed
Jakub Nowacki created SPARK-18699: - Summary: Spark CSV parsing types other than String throws exception when malformed Key: SPARK-18699 URL: https://issues.apache.org/jira/browse/SPARK-18699 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Jakub Nowacki If CSV is read and the schema contains any other type than String, exception is thrown when the string value in CSV is malformed; e.g. if the timestamp does not match the defined one, an exception is thrown: {code} Caused by: java.lang.IllegalArgumentException at java.sql.Date.valueOf(Date.java:143) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116) at org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ... 8 more {code} It behaves similarly with Integer and Long types, from what I've seen. To my understanding modes PERMISSIVE and DROPMALFORMED should just null the value or drop the line, but instead they kill the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org