[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208883#comment-16208883 ] Soumitra Sulav commented on SPARK-2984: --- [~ste...@apache.org] Multiple batches/jobs run at same time as we have enabled concurrentJobs to 4 and in cases where load increases suddenly, jobs run parallel. The directory tree comprises of the partitions, current dest folder being the current hour and hence same dir. We can add batch time for each job to separate the parent folder but we may end up with lot of directories and we don't want that for current scenario. Is there a simpler way to not delete _temporary folder if any write is currently going on. > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_0
[jira] [Commented] (SPARK-22303) [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE
[ https://issues.apache.org/jira/browse/SPARK-22303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208873#comment-16208873 ] Sean Owen commented on SPARK-22303: --- If it's Oracle specific I don't think it's reasonable to expect support. > [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE > --- > > Key: SPARK-22303 > URL: https://issues.apache.org/jira/browse/SPARK-22303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kohki Nishio >Priority: Minor > > When a table contains columns such as BINARY_DOUBLE or BINARY_FLOAT, this > JDBC connector throws SQL exception > {code} > java.sql.SQLException: Unsupported type 101 > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:235) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:291) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) > {code} > these types are Oracle specific ones, described here > https://docs.oracle.com/cd/E11882_01/timesten.112/e21642/types.htm#TTSQL148 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22303) [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE
Kohki Nishio created SPARK-22303: Summary: [SQL] Getting java.sql.SQLException: Unsupported type 101 for BINARY_DOUBLE Key: SPARK-22303 URL: https://issues.apache.org/jira/browse/SPARK-22303 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kohki Nishio Priority: Minor When a table contains columns such as BINARY_DOUBLE or BINARY_FLOAT, this JDBC connector throws SQL exception {code} java.sql.SQLException: Unsupported type 101 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:235) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:292) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:291) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:47) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:306) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) {code} these types are Oracle specific ones, described here https://docs.oracle.com/cd/E11882_01/timesten.112/e21642/types.htm#TTSQL148 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22278) Expose current event time watermark and current processing time in GroupState
[ https://issues.apache.org/jira/browse/SPARK-22278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-22278. --- Resolution: Fixed Fix Version/s: (was: 2.2.0) 3.0.0 Issue resolved by pull request 19495 [https://github.com/apache/spark/pull/19495] > Expose current event time watermark and current processing time in GroupState > - > > Key: SPARK-22278 > URL: https://issues.apache.org/jira/browse/SPARK-22278 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 3.0.0 > > > Complex state-updating and/or timeout-handling logic in mapGroupsWithState > functions may require taking decisions based on the current event-time > watermark and/or processing-time. Currently, you can use the sql function > `current_timestamp` to get the current processing time, but it needs to > passed inserted in every row with a select, and then passed through the > encoder, which isnt efficient. Furthermore, there is no way to get the > current watermark. > This JIRA is to expose them through the GroupState API. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call
[ https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208722#comment-16208722 ] Apache Spark commented on SPARK-22302: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19524 > Remove manual backports for subprocess.check_output and check_call > -- > > Key: SPARK-22302 > URL: https://issues.apache.org/jira/browse/SPARK-22302 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in > Jenkins given the past cases and investigations up to my knowledge and it > looks failing to execute some other scripts. > In this particular case, it was: > {code} > cd dev && python2.6 > {code} > {code} > >>> from sparktestsupport import shellutils > >>> shellutils.subprocess_check_call("ls") > Traceback (most recent call last): > File "", line 1, in > File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call > retcode = call(*popenargs, **kwargs) > NameError: global name 'call' is not defined > {code} > Please see > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console > Since we dropped the Python 2.6.x support, looks better we remove those > workarounds and print out explicit error messages in order to duplicate the > efforts to find out the root causes for such cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call
[ https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22302: Assignee: (was: Apache Spark) > Remove manual backports for subprocess.check_output and check_call > -- > > Key: SPARK-22302 > URL: https://issues.apache.org/jira/browse/SPARK-22302 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in > Jenkins given the past cases and investigations up to my knowledge and it > looks failing to execute some other scripts. > In this particular case, it was: > {code} > cd dev && python2.6 > {code} > {code} > >>> from sparktestsupport import shellutils > >>> shellutils.subprocess_check_call("ls") > Traceback (most recent call last): > File "", line 1, in > File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call > retcode = call(*popenargs, **kwargs) > NameError: global name 'call' is not defined > {code} > Please see > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console > Since we dropped the Python 2.6.x support, looks better we remove those > workarounds and print out explicit error messages in order to duplicate the > efforts to find out the root causes for such cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call
[ https://issues.apache.org/jira/browse/SPARK-22302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22302: Assignee: Apache Spark > Remove manual backports for subprocess.check_output and check_call > -- > > Key: SPARK-22302 > URL: https://issues.apache.org/jira/browse/SPARK-22302 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in > Jenkins given the past cases and investigations up to my knowledge and it > looks failing to execute some other scripts. > In this particular case, it was: > {code} > cd dev && python2.6 > {code} > {code} > >>> from sparktestsupport import shellutils > >>> shellutils.subprocess_check_call("ls") > Traceback (most recent call last): > File "", line 1, in > File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call > retcode = call(*popenargs, **kwargs) > NameError: global name 'call' is not defined > {code} > Please see > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console > Since we dropped the Python 2.6.x support, looks better we remove those > workarounds and print out explicit error messages in order to duplicate the > efforts to find out the root causes for such cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22302) Remove manual backports for subprocess.check_output and check_call
Hyukjin Kwon created SPARK-22302: Summary: Remove manual backports for subprocess.check_output and check_call Key: SPARK-22302 URL: https://issues.apache.org/jira/browse/SPARK-22302 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Priority: Trivial This JIRA is loosely related with SPARK-21573. Python 2.6 could be used in Jenkins given the past cases and investigations up to my knowledge and it looks failing to execute some other scripts. In this particular case, it was: {code} cd dev && python2.6 {code} {code} >>> from sparktestsupport import shellutils >>> shellutils.subprocess_check_call("ls") Traceback (most recent call last): File "", line 1, in File "sparktestsupport/shellutils.py", line 46, in subprocess_check_call retcode = call(*popenargs, **kwargs) NameError: global name 'call' is not defined {code} Please see https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3950/console Since we dropped the Python 2.6.x support, looks better we remove those workarounds and print out explicit error messages in order to duplicate the efforts to find out the root causes for such cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208708#comment-16208708 ] Peng Meng commented on SPARK-22295: --- Hi [~cheburakshu] , thanks for reporting this bug and helpful code. This is caused by similar problem but not the same thing as SPARK-22277. The reason is when transform a dataframe, the field/attribute is not correctly set. Maybe there are some other similar bugs in the code, we can solve them separately, or solve them together. [~yanboliang] [~mlnick] [~srowen] > Chi Square selector not recognizing field in Data frame > --- > > Key: SPARK-22295 > URL: https://issues.apache.org/jira/browse/SPARK-22295 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > ChiSquare selector is not recognizing the field 'class' which is present in > the data frame while fitting the model. I am using PIMA Indians diabetes > dataset of UCI. The complete code and output is below for reference. But, > when some rows of the input file is created as a dataframe manually, it will > work. Couldn't understand the pattern here. > Kindly help. > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol='class').fit(df) > except: > print(sys.exc_info()) > {code} > Output: > ++-+-+-+-+-+-++--+ > |preg| plas| pres| skin| test| mass| pedi| age| class| > ++-+-+-+-+-+-++--+ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| > ++-+-+-+-+-+-++--+ > only showing top 1 row > ++-+-+-+-+-+-++--++ > |preg| plas| pres| skin| test| mass| pedi| age| class|features| > ++-+-+-+-+-+-++--++ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| > ++-+-+-+-+-+-++--++ > only showing top 1 row > (, > IllegalArgumentException('Field "class" does not exist.', > 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at > scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at > org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at > org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t > at > org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t > at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > *The below code works fine: > * > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > # Just pasted a few rows from the input file and created a data frome. This > will work, but not the frame picked up from the file > df = spark.createDataFrame([ > [6,148,72,35,0,33.6,0.627,50,1], > [1,85,66,29,0,26.6,0.351,31,0], > [8,183,64,0,0,23.3,0.672,32,1], > ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > "class"]) > df.show(1) > assembler = VectorAssembler(
[jira] [Commented] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208685#comment-16208685 ] Hyukjin Kwon commented on SPARK-17902: -- Hi [~falaki] and [~shivaram], I was thinking a just simple way such as : {quote} if (stringsAsFactors) { df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor) } {quote} Would it make sense? > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17902) collect() ignores stringsAsFactors
[ https://issues.apache.org/jira/browse/SPARK-17902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208685#comment-16208685 ] Hyukjin Kwon edited comment on SPARK-17902 at 10/18/17 1:29 AM: Hi [~falaki] and [~shivaram], I was thinking a just simple way such as : {code} if (stringsAsFactors) { df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor) } {code} Would it make sense? was (Author: hyukjin.kwon): Hi [~falaki] and [~shivaram], I was thinking a just simple way such as : {quote} if (stringsAsFactors) { df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor) } {quote} Would it make sense? > collect() ignores stringsAsFactors > -- > > Key: SPARK-17902 > URL: https://issues.apache.org/jira/browse/SPARK-17902 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > `collect()` function signature includes an optional flag named > `stringsAsFactors`. It seems it is completely ignored. > {code} > str(collect(createDataFrame(iris), stringsAsFactors = TRUE))) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22204) Explain output for SQL with commands shows no optimization
[ https://issues.apache.org/jira/browse/SPARK-22204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208676#comment-16208676 ] Andrew Ash commented on SPARK-22204: One way to work around this issue could be by getting the child of the command node and running explain on that. This does do the query planning twice though. See also discussion at https://github.com/apache/spark/pull/19269#discussion_r139841435 > Explain output for SQL with commands shows no optimization > -- > > Key: SPARK-22204 > URL: https://issues.apache.org/jira/browse/SPARK-22204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Andrew Ash > > When displaying the explain output for a basic SELECT query, the query plan > changes as expected from analyzed -> optimized stages. But when putting that > same query into a command, for example {{CREATE TABLE}} it appears that the > optimization doesn't take place. > In Spark shell: > Explain output for a {{SELECT}} statement shows optimization: > {noformat} > scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) > AS b) AS c) AS d").explain(true) > == Parsed Logical Plan == > 'Project ['a] > +- 'SubqueryAlias d >+- 'Project ['a] > +- 'SubqueryAlias c > +- 'Project ['a] > +- SubqueryAlias b >+- Project [1 AS a#29] > +- OneRowRelation > == Analyzed Logical Plan == > a: int > Project [a#29] > +- SubqueryAlias d >+- Project [a#29] > +- SubqueryAlias c > +- Project [a#29] > +- SubqueryAlias b >+- Project [1 AS a#29] > +- OneRowRelation > == Optimized Logical Plan == > Project [1 AS a#29] > +- OneRowRelation > == Physical Plan == > *Project [1 AS a#29] > +- Scan OneRowRelation[] > scala> > {noformat} > But the same command run inside {{CREATE TABLE}} does not: > {noformat} > scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM > (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true) > == Parsed Logical Plan == > 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > Ignore > +- 'Project ['a] >+- 'SubqueryAlias d > +- 'Project ['a] > +- 'SubqueryAlias c > +- 'Project ['a] >+- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Analyzed Logical Plan == > CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, > InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Optimized Logical Plan == > CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, > InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Physical Plan == > CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand > [Database:default}, TableName: tmptable, InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > scala> > {noformat} > Note that there is no change between the analyzed and optimized plans when > run in a command. > This is misleading my users, as they think that there is no optimization > happening in the query! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22301) Add rule to Optimizer for In with empty list of values
[ https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208640#comment-16208640 ] Apache Spark commented on SPARK-22301: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19523 > Add rule to Optimizer for In with empty list of values > -- > > Key: SPARK-22301 > URL: https://issues.apache.org/jira/browse/SPARK-22301 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > For performance reason, we should resolve in operation on an empty list as > false in the optimizations phase. > For further reference, please look at the discussion on PRs: > https://github.com/apache/spark/pull/19522 and > https://github.com/apache/spark/pull/19494. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22301) Add rule to Optimizer for In with empty list of values
[ https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22301: Assignee: (was: Apache Spark) > Add rule to Optimizer for In with empty list of values > -- > > Key: SPARK-22301 > URL: https://issues.apache.org/jira/browse/SPARK-22301 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > For performance reason, we should resolve in operation on an empty list as > false in the optimizations phase. > For further reference, please look at the discussion on PRs: > https://github.com/apache/spark/pull/19522 and > https://github.com/apache/spark/pull/19494. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22301) Add rule to Optimizer for In with empty list of values
[ https://issues.apache.org/jira/browse/SPARK-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22301: Assignee: Apache Spark > Add rule to Optimizer for In with empty list of values > -- > > Key: SPARK-22301 > URL: https://issues.apache.org/jira/browse/SPARK-22301 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido >Assignee: Apache Spark > > For performance reason, we should resolve in operation on an empty list as > false in the optimizations phase. > For further reference, please look at the discussion on PRs: > https://github.com/apache/spark/pull/19522 and > https://github.com/apache/spark/pull/19494. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22301) Add rule to Optimizer for In with empty list of values
Marco Gaido created SPARK-22301: --- Summary: Add rule to Optimizer for In with empty list of values Key: SPARK-22301 URL: https://issues.apache.org/jira/browse/SPARK-22301 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Marco Gaido For performance reason, we should resolve in operation on an empty list as false in the optimizations phase. For further reference, please look at the discussion on PRs: https://github.com/apache/spark/pull/19522 and https://github.com/apache/spark/pull/19494. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208614#comment-16208614 ] yuhao yang commented on SPARK-22289: Thanks for the reply. I'll start compose a PR. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208607#comment-16208607 ] Joseph K. Bradley commented on SPARK-13030: --- Does multi-column support need to be put in this same task? Isn't that an orthogonal issue? Or is the proposal to use this chance to break the API to add simpler multi-column support? > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208601#comment-16208601 ] Yanbo Liang commented on SPARK-22289: - +1 for option 2. Please feel free to send a PR. Thanks. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208585#comment-16208585 ] Liang-Chi Hsieh edited comment on SPARK-22283 at 10/17/17 11:43 PM: [~kitbellew] {{withColumn}} adds/replaces existing column that has the same name. In the ambiguous columns case, it sounds reasonable that it replaces the columns with the same name. We should let a API does one thing. {{withColumn}} adds/replaces column with the same name. It sounds weird to implicitly drop one column when there are ambiguous columns. For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), b("c")).select(..., "c")}}, if you just want to get one of the ambiguous columns, the simple workaround can be simply selecting the column like {{a.join(b, ..., "left").select(..., a("c"))}}. was (Author: viirya): [~kitbellew] I didn't mean you're doing select. I meant you can't select the ambiguous columns by name, so isn't it reasonable that you can't also withColumn the ambiguous columns by name? They are following the same behavior. For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), b("c")).select(..., "c")}}, if you just want to get one of the ambiguous columns, the simple workaround can be simply selecting the column like {{a.join(b, ..., "left").select(..., a("c"))}}. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208585#comment-16208585 ] Liang-Chi Hsieh commented on SPARK-22283: - [~kitbellew] I didn't mean you're doing select. I meant you can't select the ambiguous columns by name, so isn't it reasonable that you can't also withColumn the ambiguous columns by name? They are following the same behavior. For the use case {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), b("c")).select(..., "c")}}, if you just want to get one of the ambiguous columns, the simple workaround can be simply selecting the column like {{a.join(b, ..., "left").select(..., a("c"))}}. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208462#comment-16208462 ] Apache Spark commented on SPARK-22249: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19522 > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$ano
[jira] [Resolved] (SPARK-22050) Allow BlockUpdated events to be optionally logged to the event log
[ https://issues.apache.org/jira/browse/SPARK-22050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22050. Resolution: Fixed Assignee: Michael Mior Fix Version/s: 2.3.0 > Allow BlockUpdated events to be optionally logged to the event log > -- > > Key: SPARK-22050 > URL: https://issues.apache.org/jira/browse/SPARK-22050 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Michael Mior >Assignee: Michael Mior >Priority: Minor > Fix For: 2.3.0 > > > I see that block updates are not logged to the event log. > This makes sense as a default for performance reasons. > However, I find it helpful when trying to get a better understanding of > caching for a job to be able to log these updates. > This PR adds a configuration setting {{spark.eventLog.blockUpdates}} > (defaulting to false) which allows block updates to be recorded in the log. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22298) SparkUI executor URL encode appID
[ https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22298: -- Issue Type: Improvement (was: Bug) > SparkUI executor URL encode appID > - > > Key: SPARK-22298 > URL: https://issues.apache.org/jira/browse/SPARK-22298 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0 >Reporter: Alexander Naspo >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > Spark Executor Page will return a blank list when the application id contains > a forward slash. You can see the /allexecutors api failing with a 404. This > can be fixed trivially by url encoding the appId before making the call to > `/api/v1/applications//allexecutors` in executorspage.js. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22271) Describe results in "null" for the value of "mean" of a numeric variable
[ https://issues.apache.org/jira/browse/SPARK-22271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22271. - Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 2.3.0 2.2.1 > Describe results in "null" for the value of "mean" of a numeric variable > > > Key: SPARK-22271 > URL: https://issues.apache.org/jira/browse/SPARK-22271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: >Reporter: Shafique Jamal >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > Attachments: decimalNumbers.zip > > > Please excuse me if this issue was addressed already - I was unable to find > it. > Calling .describe().show() on my dataframe results in a value of null for the > row "mean": > {noformat} > val foo = spark.read.parquet("decimalNumbers.parquet") > foo.select(col("numericvariable")).describe().show() > foo: org.apache.spark.sql.DataFrame = [numericvariable: decimal(38,32)] > +---++ > |summary| numericvariable| > +---++ > | count| 299| > | mean|null| > | stddev| 0.2376438793946738| > |min|0.037815489727642...| > |max|2.138189366554511...| > {noformat} > But all of the rows for this seem ok (I can attache a parquet file). When I > round the column, however, all is fine: > {noformat} > foo.select(bround(col("numericvariable"), 31)).describe().show() > +---+---+ > |summary|bround(numericvariable, 31)| > +---+---+ > | count|299| > | mean| 0.139522503183236...| > | stddev| 0.2376438793946738| > |min| 0.037815489727642...| > |max| 2.138189366554511...| > +---+---+ > {noformat} > Rounding using 32 gives null also though. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs.
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randy Tidd updated SPARK-22296: --- Summary: CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. (was: CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java') > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq > argValue84; > build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq > argValue85; > and > build 17-Oct-2017 05:30:54/* 8217 */ final > com.xyz.xyz.xyz.domain.Account value1 = false ? null : new > com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, > argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, > argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, > argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, > argValue24, argValue25, argValue26, argValue27, argValue28,
[jira] [Assigned] (SPARK-22300) Update ORC to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22300: Assignee: (was: Apache Spark) > Update ORC to 1.4.1 > --- > > Key: SPARK-22300 > URL: https://issues.apache.org/jira/browse/SPARK-22300 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > Apache ORC 1.4.1 is released yesterday. > - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/ > Like ORC-233 (Allow `orc.include.columns` to be empty), there are several > important fixes. > We had better use the latest one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22300) Update ORC to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208180#comment-16208180 ] Apache Spark commented on SPARK-22300: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19521 > Update ORC to 1.4.1 > --- > > Key: SPARK-22300 > URL: https://issues.apache.org/jira/browse/SPARK-22300 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > Apache ORC 1.4.1 is released yesterday. > - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/ > Like ORC-233 (Allow `orc.include.columns` to be empty), there are several > important fixes. > We had better use the latest one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22300) Update ORC to 1.4.1
[ https://issues.apache.org/jira/browse/SPARK-22300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22300: Assignee: Apache Spark > Update ORC to 1.4.1 > --- > > Key: SPARK-22300 > URL: https://issues.apache.org/jira/browse/SPARK-22300 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Apache ORC 1.4.1 is released yesterday. > - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/ > Like ORC-233 (Allow `orc.include.columns` to be empty), there are several > important fixes. > We had better use the latest one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randy Tidd updated SPARK-22296: --- Description: This is with Scala 2.11. We have a case class that has a constructor with 85 args, the last two of which are: var chargesInst : scala.collection.mutable.Seq[ChargeInstitutional] = scala.collection.mutable.Seq.empty[ChargeInstitutional], var chargesProf : scala.collection.mutable.Seq[ChargeProfessional] = scala.collection.mutable.Seq.empty[ChargeProfessional] A mutable Seq in a the constructor of a case class is probably poor form but Scala allows it. When we run this job we get this error: build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 8217, Column 70: No applicable constructor/method found for actual parameters "java.lang.String, java.lang.String, long, java.lang.String, long, long, long, java.lang.String, long, long, double, scala.Option, scala.Option, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, int, double, double, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, java.lang.String, long, java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, scala.collection.Seq, boolean, scala.collection.Seq, scala.collection.Seq"; candidates are: "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, java.lang.String, long, long, long, java.lang.String, long, long, double, scala.Option, scala.Option, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, int, double, double, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, java.lang.String, long, java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, scala.collection.Seq, boolean, scala.collection.mutable.Seq, scala.collection.mutable.Seq)" The relevant lines are: build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq argValue84; build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq argValue85; and build 17-Oct-2017 05:30:54/* 8217 */ final com.xyz.xyz.xyz.domain.Account value1 = false ? null : new com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, argValue36, argValue37, argValue38, argValue39, argValue40, argValue41, argValue42, argValue43, argValue44, argValue45, argValue46, argValue47, argValue48, argValue49, argValue50, argValue51, argValue52, argValue53, argValue54, argValue55, argValue56, argValue57, argValue58, argValue59, argValue60, argValue61, argValue62, argValue63, argValue64, argValue65, argValue66, argValue67, argValue68, argValue69, argValue70, argValue71, argValue72, argValue73, argValue74, argValue75, argValue76, argValue77, argValue78, argValue79, argValue80, argValue81, argValue82, argValue83, argValue84, argValue85); In short, Spark uses scala.collection.Seq in the generated code which is
[jira] [Updated] (SPARK-22296) CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq
[ https://issues.apache.org/jira/browse/SPARK-22296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randy Tidd updated SPARK-22296: --- Summary: CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. scala.collection.Seq (was: CodeGenerator - failed to compile when constructor has scala.collection.mutable.Seq vs. ) > CodeGenerator - failed to compile when constructor has > scala.collection.mutable.Seq vs. scala.collection.Seq > > > Key: SPARK-22296 > URL: https://issues.apache.org/jira/browse/SPARK-22296 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Randy Tidd > > This is with Scala 2.11. > We have a case class that has a constructor with 85 args, the last two of > which are: > var chargesInst : > scala.collection.mutable.Seq[ChargeInstitutional] = > scala.collection.mutable.Seq.empty[ChargeInstitutional], > var chargesProf : > scala.collection.mutable.Seq[ChargeProfessional] = > scala.collection.mutable.Seq.empty[ChargeProfessional] > A mutable Seq in a the constructor of a case class is probably poor form but > Scala allows it. When we run this job we get this error: > build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch > worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 8217, Column 70: No applicable constructor/method found for actual parameters > "java.lang.String, java.lang.String, long, java.lang.String, long, long, > long, java.lang.String, long, long, double, scala.Option, scala.Option, > java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, long, long, long, long, > scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, long, java.lang.String, int, double, > double, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, > com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, > java.lang.String, long, java.lang.String, int, int, boolean, boolean, > scala.collection.Seq, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, scala.collection.Seq"; candidates are: > "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, > java.lang.String, long, long, long, java.lang.String, long, long, double, > scala.Option, scala.Option, java.lang.String, java.lang.String, long, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > long, long, long, long, scala.Option, scala.Option, scala.Option, > scala.Option, scala.Option, java.lang.String, java.lang.String, > java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, > java.lang.String, int, double, double, java.lang.String, java.lang.String, > java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, > java.lang.String, long, long, long, long, java.lang.String, > com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, > scala.collection.Seq, scala.collection.Seq, java.lang.String, long, > java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, > scala.collection.Seq, boolean, scala.collection.mutable.Seq, > scala.collection.mutable.Seq)" > The relevant lines are: > build 17-Oct-2017 05:30:50/* 093 */ private scala.collection.Seq > argValue84; > build 17-Oct-2017 05:30:50/* 094 */ private scala.collection.Seq > argValue85; > and > build 17-Oct-2017 05:30:54/* 8217 */ final > com.xyz.xyz.xyz.domain.Account value1 = false ? null : new > com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, > argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, > argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, > argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, > argValue24, argV
[jira] [Created] (SPARK-22300) Update ORC to 1.4.1
Dongjoon Hyun created SPARK-22300: - Summary: Update ORC to 1.4.1 Key: SPARK-22300 URL: https://issues.apache.org/jira/browse/SPARK-22300 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.3.0 Reporter: Dongjoon Hyun Apache ORC 1.4.1 is released yesterday. - https://orc.apache.org/news/2017/10/16/ORC-1.4.1/ Like ORC-233 (Allow `orc.include.columns` to be empty), there are several important fixes. We had better use the latest one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties
[ https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21840: Assignee: (was: Apache Spark) > Allow multiple SparkSubmit invocations in same JVM without polluting system > properties > -- > > Key: SPARK-21840 > URL: https://issues.apache.org/jira/browse/SPARK-21840 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Filing this as a sub-task of SPARK-11035; this feature was discussed as part > of the PR currently attached to that bug. > Basically, to allow the launcher library to run applications in-process, the > easiest way is for it to run the {{SparkSubmit}} class. But that class > currently propagates configuration to applications by modifying system > properties. > That means that when launching multiple applications in that manner in the > same JVM, the configuration of the first application may leak into the second > application (or to any other invocation of `new SparkConf()` for that matter). > This feature is about breaking out the fix for this particular issue from the > PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation > can even be further enhanced by providing an actual {{SparkConf}} instance to > the application, instead of opaque maps. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22298) SparkUI executor URL encode appID
[ https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208172#comment-16208172 ] Apache Spark commented on SPARK-22298: -- User 'alexnaspo' has created a pull request for this issue: https://github.com/apache/spark/pull/19520 > SparkUI executor URL encode appID > - > > Key: SPARK-22298 > URL: https://issues.apache.org/jira/browse/SPARK-22298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.2.0 >Reporter: Alexander Naspo >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > Spark Executor Page will return a blank list when the application id contains > a forward slash. You can see the /allexecutors api failing with a 404. This > can be fixed trivially by url encoding the appId before making the call to > `/api/v1/applications//allexecutors` in executorspage.js. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties
[ https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208171#comment-16208171 ] Apache Spark commented on SPARK-21840: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19519 > Allow multiple SparkSubmit invocations in same JVM without polluting system > properties > -- > > Key: SPARK-21840 > URL: https://issues.apache.org/jira/browse/SPARK-21840 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Filing this as a sub-task of SPARK-11035; this feature was discussed as part > of the PR currently attached to that bug. > Basically, to allow the launcher library to run applications in-process, the > easiest way is for it to run the {{SparkSubmit}} class. But that class > currently propagates configuration to applications by modifying system > properties. > That means that when launching multiple applications in that manner in the > same JVM, the configuration of the first application may leak into the second > application (or to any other invocation of `new SparkConf()` for that matter). > This feature is about breaking out the fix for this particular issue from the > PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation > can even be further enhanced by providing an actual {{SparkConf}} instance to > the application, instead of opaque maps. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22298) SparkUI executor URL encode appID
[ https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22298: Assignee: Apache Spark > SparkUI executor URL encode appID > - > > Key: SPARK-22298 > URL: https://issues.apache.org/jira/browse/SPARK-22298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.2.0 >Reporter: Alexander Naspo >Assignee: Apache Spark >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > Spark Executor Page will return a blank list when the application id contains > a forward slash. You can see the /allexecutors api failing with a 404. This > can be fixed trivially by url encoding the appId before making the call to > `/api/v1/applications//allexecutors` in executorspage.js. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21840) Allow multiple SparkSubmit invocations in same JVM without polluting system properties
[ https://issues.apache.org/jira/browse/SPARK-21840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21840: Assignee: Apache Spark > Allow multiple SparkSubmit invocations in same JVM without polluting system > properties > -- > > Key: SPARK-21840 > URL: https://issues.apache.org/jira/browse/SPARK-21840 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > Filing this as a sub-task of SPARK-11035; this feature was discussed as part > of the PR currently attached to that bug. > Basically, to allow the launcher library to run applications in-process, the > easiest way is for it to run the {{SparkSubmit}} class. But that class > currently propagates configuration to applications by modifying system > properties. > That means that when launching multiple applications in that manner in the > same JVM, the configuration of the first application may leak into the second > application (or to any other invocation of `new SparkConf()` for that matter). > This feature is about breaking out the fix for this particular issue from the > PR linked to SPARK-11035. With the changes in SPARK-21728, the implementation > can even be further enhanced by providing an actual {{SparkConf}} instance to > the application, instead of opaque maps. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22298) SparkUI executor URL encode appID
[ https://issues.apache.org/jira/browse/SPARK-22298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22298: Assignee: (was: Apache Spark) > SparkUI executor URL encode appID > - > > Key: SPARK-22298 > URL: https://issues.apache.org/jira/browse/SPARK-22298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.2.0 >Reporter: Alexander Naspo >Priority: Trivial > Original Estimate: 1h > Remaining Estimate: 1h > > Spark Executor Page will return a blank list when the application id contains > a forward slash. You can see the /allexecutors api failing with a 404. This > can be fixed trivially by url encoding the appId before making the call to > `/api/v1/applications//allexecutors` in executorspage.js. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping
Zack Behringer created SPARK-22299: -- Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping Key: SPARK-22299 URL: https://issues.apache.org/jira/browse/SPARK-22299 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0 Reporter: Zack Behringer Priority: Minor Loading a large table (300M rows) from JDBC can be partitioned into tasks using the column, numPartitions, lowerBound and upperBound parameters on DataFrameReader.jdbc(), but that becomes troublesome if the column is skewed/fragmented (as in somebody used a global sequence for the partition column instead of a sequence specific to the table, or if the table becomes fragmented by deletes, etc.). This can be worked around by using a modulus operation on the column, but that will be slow unless there is a already an index using the modulus expression with the exact numPartitions value, so that doesn't scale well if you want to change the number partitions. Another way would be to use an expression index on a hash of the partition column, but I'm not sure if JDBC striping is smart enough to create hash ranges for each stripe using hashes of the lower and upper bound parameters. If it is, that is great, but still that requires a very large index just for this use case. A less invasive approach would be to use the table's physical ordering along with OFFSET and LIMIT so that only the total number of records to read would need to be known beforehand in order to evenly distribute, no indexes needed. I realize that OFFSET and LIMIT are not standard SQL keywords. I also see that a list of custom predicates can be defined. I haven't tried that to see if I can embed numPartitions specific predicates each with their own OFFSET and LIMIT range. Some relational databases take quite a long time to count the number of records in order to determine the stripe size, though, so this can also troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in conjunction with the number of executors to read manageable batches (internally using OFFSET and LIMIT) until there are no more available results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22298) SparkUI executor URL encode appID
Alexander Naspo created SPARK-22298: --- Summary: SparkUI executor URL encode appID Key: SPARK-22298 URL: https://issues.apache.org/jira/browse/SPARK-22298 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0, 2.1.0 Reporter: Alexander Naspo Priority: Trivial Spark Executor Page will return a blank list when the application id contains a forward slash. You can see the /allexecutors api failing with a 404. This can be fixed trivially by url encoding the appId before making the call to `/api/v1/applications//allexecutors` in executorspage.js. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"
Marcelo Vanzin created SPARK-22297: -- Summary: Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf" Key: SPARK-22297 URL: https://issues.apache.org/jira/browse/SPARK-22297 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 2.3.0 Reporter: Marcelo Vanzin Priority: Minor Ran into this locally; the test code seems to use timeouts which generally end up in flakiness like this. {noformat} [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are working *** FAILED *** (1 second, 203 milliseconds) [info] "Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task." did not contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370) [info] at org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) [info] at org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22296) CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
Randy Tidd created SPARK-22296: -- Summary: CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' Key: SPARK-22296 URL: https://issues.apache.org/jira/browse/SPARK-22296 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.1.0 Reporter: Randy Tidd We intermittently get this error when running Spark jobs. It is very intermittent, I would estimate one in every 50-100 runs. We can't always capture a full log file and our code is complex and 1000's of lines and the log file is 533MB so I can't post it all. When it occurs we just run the exact same job again and it runs fine, we do not have a repeatable case. I believe it only happens in client mode, and not cluster mode. Sorry I know this isn't a lot to go on but I wanted to report it since I see other similar bugs like SPARK-19984 and SPARK-17936. If there is any other way to collect more info or diagnostics please let me know. build 17-Oct-2017 05:30:502017-10-17 09:30:50 [Executor task launch worker-1] ERROR o.a.s.s.c.e.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 8217, Column 70: No applicable constructor/method found for actual parameters "java.lang.String, java.lang.String, long, java.lang.String, long, long, long, java.lang.String, long, long, double, scala.Option, scala.Option, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, int, double, double, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, java.lang.String, long, java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, scala.collection.Seq, boolean, scala.collection.Seq, scala.collection.Seq"; candidates are: "com.xyz.xyz.xyz.domain.Account(java.lang.String, java.lang.String, long, java.lang.String, long, long, long, java.lang.String, long, long, double, scala.Option, scala.Option, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, long, scala.Option, scala.Option, scala.Option, scala.Option, scala.Option, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, int, double, double, java.lang.String, java.lang.String, java.lang.String, long, java.lang.String, java.lang.String, java.lang.String, java.lang.String, long, long, long, long, java.lang.String, com.xyz.xyz.xyz.domain.Patient, com.xyz.xyz.xyz.domain.Physician, scala.collection.Seq, scala.collection.Seq, java.lang.String, long, java.lang.String, int, int, boolean, boolean, scala.collection.Seq, boolean, scala.collection.Seq, boolean, scala.collection.mutable.Seq, scala.collection.mutable.Seq)" The relevant line is: build 17-Oct-2017 05:30:54/* 8217 */ final com.xyz.xyz.xyz.domain.Account value1 = false ? null : new com.xyz.xyz.xyz.domain.Account(argValue2, argValue3, argValue4, argValue5, argValue6, argValue7, argValue8, argValue9, argValue10, argValue11, argValue12, argValue13, argValue14, argValue15, argValue16, argValue17, argValue18, argValue19, argValue20, argValue21, argValue22, argValue23, argValue24, argValue25, argValue26, argValue27, argValue28, argValue29, argValue30, argValue31, argValue32, argValue33, argValue34, argValue35, argValue36, argValue37, argValue38, argValue39, argValue40, argValue41, argValue42, argValue43, argValue44, argValue45, argValue46, argValue47, argValue48, argValue49, argValue50, argValue51, argValue52, argValue53, argValue54, argValue55, argValue56, argValue57, argValue58, argValue59, argValue60, argValue61, argValue62, argValue63, argValue64, argValue65, argValue66, argValue67, argValue68, argValue69, argValue70, argValue71, argValue72, argValue73, argValue74, argValue75, argValue76, argValue77, argValue78, a
[jira] [Resolved] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheburakshu resolved SPARK-22295. - Resolution: Invalid > Chi Square selector not recognizing field in Data frame > --- > > Key: SPARK-22295 > URL: https://issues.apache.org/jira/browse/SPARK-22295 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > ChiSquare selector is not recognizing the field 'class' which is present in > the data frame while fitting the model. I am using PIMA Indians diabetes > dataset of UCI. The complete code and output is below for reference. But, > when some rows of the input file is created as a dataframe manually, it will > work. Couldn't understand the pattern here. > Kindly help. > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol='class').fit(df) > except: > print(sys.exc_info()) > {code} > Output: > ++-+-+-+-+-+-++--+ > |preg| plas| pres| skin| test| mass| pedi| age| class| > ++-+-+-+-+-+-++--+ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| > ++-+-+-+-+-+-++--+ > only showing top 1 row > ++-+-+-+-+-+-++--++ > |preg| plas| pres| skin| test| mass| pedi| age| class|features| > ++-+-+-+-+-+-++--++ > | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| > ++-+-+-+-+-+-++--++ > only showing top 1 row > (, > IllegalArgumentException('Field "class" does not exist.', > 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at > org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at > scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at > org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at > org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t > at > org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t > at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > *The below code works fine: > * > {code:python} > from pyspark.ml.feature import VectorAssembler, ChiSqSelector > import sys > file_name='data/pima-indians-diabetes.data' > #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() > # Just pasted a few rows from the input file and created a data frome. This > will work, but not the frame picked up from the file > df = spark.createDataFrame([ > [6,148,72,35,0,33.6,0.627,50,1], > [1,85,66,29,0,26.6,0.351,31,0], > [8,183,64,0,0,23.3,0.672,32,1], > ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', > "class"]) > df.show(1) > assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' > test', ' mass', ' pedi', ' age'],outputCol="features") > df=assembler.transform(df) > df.show(1) > try: > css=ChiSqSelector(numTopFeatures=5, featuresCol="features", > outputCol="selected", labelCol="class").fit(df) > except: > print(sys.exc_info()) > print(css.selectedFeatures) > {code} > Output: > ++-+-
[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheburakshu updated SPARK-22295: Description: ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference. But, when some rows of the input file is created as a dataframe manually, it will work. Couldn't understand the pattern here. Kindly help. {code:python} from pyspark.ml.feature import VectorAssembler, ChiSqSelector import sys file_name='data/pima-indians-diabetes.data' df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() df.show(1) assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features") df=assembler.transform(df) df.show(1) try: css=ChiSqSelector(numTopFeatures=5, featuresCol="features", outputCol="selected", labelCol='class').fit(df) except: print(sys.exc_info()) {code} Output: ++-+-+-+-+-+-++--+ |preg| plas| pres| skin| test| mass| pedi| age| class| ++-+-+-+-+-+-++--+ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| ++-+-+-+-+-+-++--+ only showing top 1 row ++-+-+-+-+-+-++--++ |preg| plas| pres| skin| test| mass| pedi| age| class|features| ++-+-+-+-+-+-++--++ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| ++-+-+-+-+-+-++--++ only showing top 1 row (, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), ) *The below code works fine: * {code:python} from pyspark.ml.feature import VectorAssembler, ChiSqSelector import sys file_name='data/pima-indians-diabetes.data' #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() # Just pasted a few rows from the input file and created a data frome. This will work, but not the frame picked up from the file df = spark.createDataFrame([ [6,148,72,35,0,33.6,0.627,50,1], [1,85,66,29,0,26.6,0.351,31,0], [8,183,64,0,0,23.3,0.672,32,1], ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', "class"]) df.show(1) assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features") df=assembler.transform(df) df.show(1) try: css=ChiSqSelector(numTopFeatures=5, featuresCol="features", outputCol="selected", labelCol="class").fit(df) except: print(sys.exc_info()) print(css.selectedFeatures) {code} Output: ++-+-+-+-+-+-++-+ |preg| plas| pres| skin| test| mass| pedi| age|class| ++-+-+-+-+-+-++-+ | 6| 148| 72| 35|0| 33.6|0.627| 50|1| ++-+-+-+-+-+-++-+ only showing top 1 row ++-+-+-+-+-+-++-++ |preg| plas| pres| skin| test| mass| pedi| age|class|features| ++-+-+-+-+-+-++-++ | 6| 148| 72| 35|0| 33.6|0.627|
[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheburakshu updated SPARK-22295: Description: ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference. But, when some rows of the input file is created as a dataframe manually, it will work. Couldn't understand the pattern here. Kindly help. {code:python} from pyspark.ml.feature import VectorAssembler, ChiSqSelector import sys file_name='data/pima-indians-diabetes.data' df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() df.show(1) assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features") df=assembler.transform(df) df.show(1) try: css=ChiSqSelector(numTopFeatures=5, featuresCol="features", outputCol="selected", labelCol='class').fit(df) except: print(sys.exc_info()) {code} Output: ++-+-+-+-+-+-++--+ |preg| plas| pres| skin| test| mass| pedi| age| class| ++-+-+-+-+-+-++--+ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| ++-+-+-+-+-+-++--+ only showing top 1 row ++-+-+-+-+-+-++--++ |preg| plas| pres| skin| test| mass| pedi| age| class|features| ++-+-+-+-+-+-++--++ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| ++-+-+-+-+-+-++--++ only showing top 1 row (, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), ) {code:python} from pyspark.ml.feature import VectorAssembler, ChiSqSelector import sys file_name='data/pima-indians-diabetes.data' #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() # Just pasted a few rows from the input file and created a data frome. This will work, but not the frame picked up from the file df = spark.createDataFrame([ [6,148,72,35,0,33.6,0.627,50,1], [1,85,66,29,0,26.6,0.351,31,0], [8,183,64,0,0,23.3,0.672,32,1], ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age', "class"]) df.show(1) assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features") df=assembler.transform(df) df.show(1) try: css=ChiSqSelector(numTopFeatures=5, featuresCol="features", outputCol="selected", labelCol="class").fit(df) except: print(sys.exc_info()) print(css.selectedFeatures) {code} Output: ++-+-+-+-+-+-++-+ |preg| plas| pres| skin| test| mass| pedi| age|class| ++-+-+-+-+-+-++-+ | 6| 148| 72| 35|0| 33.6|0.627| 50|1| ++-+-+-+-+-+-++-+ only showing top 1 row ++-+-+-+-+-+-++-++ |preg| plas| pres| skin| test| mass| pedi| age|class|features| ++-+-+-+-+-+-++-++ | 6| 148| 72| 35|0| 33.6|0.627| 50|1|[6.0,148.0,72.0,3...
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208078#comment-16208078 ] Apache Spark commented on SPARK-18016: -- User 'bdrillard' has created a pull request for this issue: https://github.com/apache/spark/pull/19518 > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) >
[jira] [Updated] (SPARK-22295) Chi Square selector not recognizing field in Data frame
[ https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheburakshu updated SPARK-22295: Description: ChiSquare selector is not recognizing the field 'class' which is present in the data frame while fitting the model. I am using PIMA Indians diabetes dataset of UCI. The complete code and output is below for reference. Kindly help. {code:python} from pyspark.ml.feature import VectorAssembler, ChiSqSelector import sys file_name='data/pima-indians-diabetes.data' df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache() df.show(1) assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age'],outputCol="features") df=assembler.transform(df) df.show(1) try: css=ChiSqSelector(numTopFeatures=5, featuresCol="features", outputCol="selected", labelCol='class').fit(df) except: print(sys.exc_info()) {code} Output: ++-+-+-+-+-+-++--+ |preg| plas| pres| skin| test| mass| pedi| age| class| ++-+-+-+-+-+-++--+ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1| ++-+-+-+-+-+-++--+ only showing top 1 row ++-+-+-+-+-+-++--++ |preg| plas| pres| skin| test| mass| pedi| age| class|features| ++-+-+-+-+-+-++--++ | 6| 148| 72| 35|0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...| ++-+-+-+-+-+-++--++ only showing top 1 row (, IllegalArgumentException('Field "class" does not exist.', 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t at org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), ) was: There is a difference in behavior when Chisquare selector is used v direct feature use in decision tree classifier. In the below code, I have used chisquare selector as a thru' pass but the decision tree classifier is unable to process it. But, it is able to process when the features are used directly. The example is pulled out directly from Apache spark python documentation. Kindly help. {code:python} from pyspark.ml.feature import ChiSqSelector from pyspark.ml.linalg import Vectors import sys df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) # ChiSq selector will just be a pass-through. All four featuresin the i/p will be in output also. selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", outputCol="selectedFeatures", labelCol="clicked") result = selector.fit(df).transform(df) print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures()) from pyspark.ml.classification import DecisionTreeClassifier try: # Fails dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") model = dt.fit(result) except: print(sys.exc_info()) #Works dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") model = dt.fit(df) # Make predictions. Using same dataset, not splitting!! predictions = model.transform(result) # Select example rows to display. predictions.select("prediction", "clicked", "features").show(5) # Select (prediction, true label) and compute test error evaluator = Mult
[jira] [Created] (SPARK-22295) Chi Square selector not recognizing field in Data frame
Cheburakshu created SPARK-22295: --- Summary: Chi Square selector not recognizing field in Data frame Key: SPARK-22295 URL: https://issues.apache.org/jira/browse/SPARK-22295 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.1 Reporter: Cheburakshu There is a difference in behavior when Chisquare selector is used v direct feature use in decision tree classifier. In the below code, I have used chisquare selector as a thru' pass but the decision tree classifier is unable to process it. But, it is able to process when the features are used directly. The example is pulled out directly from Apache spark python documentation. Kindly help. {code:python} from pyspark.ml.feature import ChiSqSelector from pyspark.ml.linalg import Vectors import sys df = spark.createDataFrame([ (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"]) # ChiSq selector will just be a pass-through. All four featuresin the i/p will be in output also. selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", outputCol="selectedFeatures", labelCol="clicked") result = selector.fit(df).transform(df) print("ChiSqSelector output with top %d features selected" % selector.getNumTopFeatures()) from pyspark.ml.classification import DecisionTreeClassifier try: # Fails dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") model = dt.fit(result) except: print(sys.exc_info()) #Works dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") model = dt.fit(df) # Make predictions. Using same dataset, not splitting!! predictions = model.transform(result) # Select example rows to display. predictions.select("prediction", "clicked", "features").show(5) # Select (prediction, true label) and compute test error evaluator = MulticlassClassificationEvaluator( labelCol="clicked", predictionCol="prediction", metricName="accuracy") accuracy = evaluator.evaluate(predictions) print("Test Error = %g " % (1.0 - accuracy)) {code} Output: ChiSqSelector output with top 4 features selected (, IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but it does not have the number of values specified.', 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t at org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t at java.lang.reflect.Method.invoke(Method.java:498)\n\t at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at py4j.Gateway.invoke(Gateway.java:280)\n\t at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at java.lang.Thread.run(Thread.java:745)'), ) +--+---+--+ |prediction|clicked| features| +--+---+--+ | 1.0|1.0|[0.0,0.0,18.0,1.0]| | 0.0|0.0|[0.0,1.0,12.0,0.0]| | 0.0|0.0|[1.0,0.0,15.0,0.1]| +--+---+--+ Test Error = 0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22249: Component/s: (was: PySpark) SQL > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) >
[jira] [Commented] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208027#comment-16208027 ] Ruslan Dautkhanov commented on SPARK-21213: --- Would the partition-level stats be compatible with Hive/Impala partition-level stats? Thanks. > Support collecting partition-level statistics: rowCount and sizeInBytes > --- > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria >Assignee: Maria > Fix For: 2.3.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for individual partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208020#comment-16208020 ] Albert Meltzer commented on SPARK-22283: [~cjm] thank you for finding the new implementation, hopefully it addresses this problem; also, it would have been nice to expose {{withColumns(columns: Map[String, Column])}} as public and with a different signature. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208011#comment-16208011 ] Albert Meltzer commented on SPARK-22283: [~viirya] I'm not doing select, I'm trying to replace the column with another value. The use case is rather simple: {{a.join(b, ..., "left").withColumn("c", coalesce(a("c"), b("c")).select(..., "c")}}. So I'm explicitly differentiating between the two sides, but then I can't select the presumably unique column in the result since it's now there multiple times. > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21459) Some aggregation functions change the case of nested field names
[ https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Allsopp updated SPARK-21459: -- Description: When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names. For example, {{first()}} preserves the case of the field names, but {{collect_set()}} and {{collect_list()}} force the field names to lowercase. Expected behavior: Field name case is preserved (or is at least consistent and documented) Spark-shell session to reproduce: *Update*: After trying different versions, I discovered that this problem occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain Spark. The plain Spark 1.6.0 does not support structs in aggregation operations such as {{collect_set}} at all. {code:java} case class Inner(Key:String, Value:String) case class Outer(ID:Long, Pairs:Array[Inner]) val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") val df = sqlContext.createDataFrame(rdd) scala> df ... = [ID: bigint, Pairs: array>] scala>df.groupBy("ID").agg(first("Pairs")) ... = [ID: bigint, first(Pairs)(): array>] // Note that Key and Value preserve their original case scala>df.groupBy("ID").agg(collect_set("Pairs")) ... = [ID: bigint, collect_set(Pairs): array>] // Note that key and value are now lowercased {code} Additionally, the column name (generated during aggregation) is inconsistent: {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses in the first name. was: When working with DataFrames with nested schemas, the behavior of the aggregation functions is inconsistent with respect to preserving the case of the nested field names. For example, {{first()}} preserves the case of the field names, but {{collect_set()}} and {{collect_list()}} force the field names to lowercase. Expected behavior: Field name case is preserved (or is at least consistent and documented) Spark-shell session to reproduce: {code:java} case class Inner(Key:String, Value:String) case class Outer(ID:Long, Pairs:Array[Inner]) val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") val df = sqlContext.createDataFrame(rdd) scala> df ... = [ID: bigint, Pairs: array>] scala>df.groupBy("ID").agg(first("Pairs")) ... = [ID: bigint, first(Pairs)(): array>] // Note that Key and Value preserve their original case scala>df.groupBy("ID").agg(collect_set("Pairs")) ... = [ID: bigint, collect_set(Pairs): array>] // Note that key and value are now lowercased {code} Additionally, the column name (generated during aggregation) is inconsistent: {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses in the first name. > Some aggregation functions change the case of nested field names > > > Key: SPARK-21459 > URL: https://issues.apache.org/jira/browse/SPARK-21459 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: David Allsopp >Priority: Minor > > When working with DataFrames with nested schemas, the behavior of the > aggregation functions is inconsistent with respect to preserving the case of > the nested field names. > For example, {{first()}} preserves the case of the field names, but > {{collect_set()}} and {{collect_list()}} force the field names to lowercase. > Expected behavior: Field name case is preserved (or is at least consistent > and documented) > Spark-shell session to reproduce: > *Update*: After trying different versions, I discovered that this problem > occurs in the version of Spark 1.6.0 shipped with Cloudera CDH, not plain > Spark. > The plain Spark 1.6.0 does not support structs in aggregation operations such > as {{collect_set}} at all. > {code:java} > case class Inner(Key:String, Value:String) > case class Outer(ID:Long, Pairs:Array[Inner]) > val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") > val df = sqlContext.createDataFrame(rdd) > scala> df > ... = [ID: bigint, Pairs: array>] > scala>df.groupBy("ID").agg(first("Pairs")) > ... = [ID: bigint, first(Pairs)(): array>] > // Note that Key and Value preserve their original case > scala>df.groupBy("ID").agg(collect_set("Pairs")) > ... = [ID: bigint, collect_set(Pairs): array>] > // Note that key and value are now lowercased > {code} > Additionally, the column name (generated during aggregation) is inconsistent: > {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses > in the first name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: i
[jira] [Comment Edited] (SPARK-21459) Some aggregation functions change the case of nested field names
[ https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207270#comment-16207270 ] David Allsopp edited comment on SPARK-21459 at 10/17/17 4:54 PM: - Just trying to see when this problem was resolved: * *Update*: It is present in the _Cloudera _distribution 1.6.0 CDH 5.8+, not plain 1.6.0 * In 1.6.0 and 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only primitive type arguments are accepted but array> was passed as parameter 1..;}} * In 2.0.2 and 2.2.0 it works as expected was (Author: dallsoppuk): Just trying to see when this problem was resolved: * It is present in 1.6.0, as originally reported * In 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only primitive type arguments are accepted but array> was passed as parameter 1..;}} * In 2.0.2 and 2.2.0 the problem has gone. > Some aggregation functions change the case of nested field names > > > Key: SPARK-21459 > URL: https://issues.apache.org/jira/browse/SPARK-21459 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: David Allsopp >Priority: Minor > > When working with DataFrames with nested schemas, the behavior of the > aggregation functions is inconsistent with respect to preserving the case of > the nested field names. > For example, {{first()}} preserves the case of the field names, but > {{collect_set()}} and {{collect_list()}} force the field names to lowercase. > Expected behavior: Field name case is preserved (or is at least consistent > and documented) > Spark-shell session to reproduce: > {code:java} > case class Inner(Key:String, Value:String) > case class Outer(ID:Long, Pairs:Array[Inner]) > val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") > val df = sqlContext.createDataFrame(rdd) > scala> df > ... = [ID: bigint, Pairs: array>] > scala>df.groupBy("ID").agg(first("Pairs")) > ... = [ID: bigint, first(Pairs)(): array>] > // Note that Key and Value preserve their original case > scala>df.groupBy("ID").agg(collect_set("Pairs")) > ... = [ID: bigint, collect_set(Pairs): array>] > // Note that key and value are now lowercased > {code} > Additionally, the column name (generated during aggregation) is inconsistent: > {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses > in the first name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22181) ReplaceExceptWithFilter if one or both of the datasets are fully derived out of Filters from a same parent
[ https://issues.apache.org/jira/browse/SPARK-22181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sathiya Kumar updated SPARK-22181: -- Description: While applying Except operator between two datasets, if one or both of the datasets are purely transformed using filter operations, then instead of rewriting the Except operator using expensive join operation, we can rewrite it using filter operation by flipping the filter condition of the right node. Example: {code:sql} SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 <> 5) {code} For more details please refer: [this post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md] was: While applying Except operator between two datasets, if one or both of the datasets are purely transformed using filter operations, then instead of rewriting the Except operator using expensive join operation, we can rewrite it using filter operation by flipping the filter condition of the right node. Example: {code:sql} SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE a1 = 5 ==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND a1 <> 5 {code} For more details please refer: [this post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md] > ReplaceExceptWithFilter if one or both of the datasets are fully derived out > of Filters from a same parent > -- > > Key: SPARK-22181 > URL: https://issues.apache.org/jira/browse/SPARK-22181 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > While applying Except operator between two datasets, if one or both of the > datasets are purely transformed using filter operations, then instead of > rewriting the Except operator using expensive join operation, we can rewrite > it using filter operation by flipping the filter condition of the right node. > Example: > {code:sql} >SELECT a1, a2 FROM Tab1 WHERE a2 = 12 EXCEPT SELECT a1, a2 FROM Tab1 WHERE > a1 = 5 >==> SELECT DISTINCT a1, a2 FROM Tab1 WHERE a2 = 12 AND (a1 is null OR a1 > <> 5) > {code} > For more details please refer: [this > post|https://github.com/sathiyapk/Blog-Posts/blob/master/SparkOptimizer.md] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207798#comment-16207798 ] Fernando Pereira commented on SPARK-22250: -- I did some tests and even though verifySchema=False might help in some cases it is still not enough for the case of handling numpy arrays. Using arrays from the array module work nicely (even without disabling verifySchema). I think it is because elements of array.array are automatically converted to their python corresponding type. So I think the problem mentioned involves two issues: # Accept ints to float fields. Apparently is it just a matter of a schema verification issue, so _acceptable_types should be updated # Accept numpy array for an ArrayField. In this case, with verifySchema=False there's still an Exception: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) I believe numpy array support, even in this simple case, would be extremely valuable for a lot of people. In our case we work with large hdf5 files where the data interface is numpy. > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22283) withColumn should replace multiple instances with a single one
[ https://issues.apache.org/jira/browse/SPARK-22283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207780#comment-16207780 ] Liang-Chi Hsieh commented on SPARK-22283: - When joined result has duplicate column name, you can't select any of the ambiguous columns by just name. Doesn't {{withColumn}} current behavior simply follow it? > withColumn should replace multiple instances with a single one > -- > > Key: SPARK-22283 > URL: https://issues.apache.org/jira/browse/SPARK-22283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Albert Meltzer > > Currently, {{withColumn}} claims to do the following: _"adding a column or > replacing the existing column that has the same name."_ > Unfortunately, if multiple existing columns have the same name (which is a > normal occurrence after a join), this results in multiple replaced -- and > retained -- > columns (with the same value), and messages about an ambiguous column. > The current implementation of {{withColumn}} contains this: > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val shouldReplace = output.exists(f => resolver(f.name, colName)) > if (shouldReplace) { > val columns = output.map { field => > if (resolver(field.name, colName)) { > col.as(colName) > } else { > Column(field) > } > } > select(columns : _*) > } else { > select(Column("*"), col.as(colName)) > } > } > {noformat} > Instead, suggest something like this (which replaces all matching fields with > a single instance of the new one): > {noformat} > def withColumn(colName: String, col: Column): DataFrame = { > val resolver = sparkSession.sessionState.analyzer.resolver > val output = queryExecution.analyzed.output > val existing = output.filterNot(f => resolver(f.name, colName)).map(new > Column(_)) > select(existing :+ col.as(colName): _*) > } > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22062) BlockManager does not account for memory consumed by remote fetches
[ https://issues.apache.org/jira/browse/SPARK-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22062: --- Assignee: Saisai Shao > BlockManager does not account for memory consumed by remote fetches > --- > > Key: SPARK-22062 > URL: https://issues.apache.org/jira/browse/SPARK-22062 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.2.0 >Reporter: Sergei Lebedev >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.3.0 > > > We use Spark exclusively with {{StorageLevel.DiskOnly}} as our workloads are > very sensitive to memory usage. Recently, we've spotted that the jobs > sometimes OOM leaving lots of byte[] arrays on the heap. Upon further > investigation, we've found that the arrays come from > {{BlockManager.getRemoteBytes}}, which > [calls|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L638] > {{BlockTransferService.fetchBlockSync}}, which in its turn would > [allocate|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L99] > an on-heap {{ByteBuffer}} of the same size as the block (e.g. full > partition), if the block was successfully retrieved over the network. > This memory is not accounted towards Spark storage/execution memory and could > potentially lead to OOM if {{BlockManager}} fetches too many partitions in > parallel. I wonder if this is intentional behaviour, or in fact a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22062) BlockManager does not account for memory consumed by remote fetches
[ https://issues.apache.org/jira/browse/SPARK-22062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22062. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19476 [https://github.com/apache/spark/pull/19476] > BlockManager does not account for memory consumed by remote fetches > --- > > Key: SPARK-22062 > URL: https://issues.apache.org/jira/browse/SPARK-22062 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 2.2.0 >Reporter: Sergei Lebedev >Priority: Minor > Fix For: 2.3.0 > > > We use Spark exclusively with {{StorageLevel.DiskOnly}} as our workloads are > very sensitive to memory usage. Recently, we've spotted that the jobs > sometimes OOM leaving lots of byte[] arrays on the heap. Upon further > investigation, we've found that the arrays come from > {{BlockManager.getRemoteBytes}}, which > [calls|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L638] > {{BlockTransferService.fetchBlockSync}}, which in its turn would > [allocate|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/BlockTransferService.scala#L99] > an on-heap {{ByteBuffer}} of the same size as the block (e.g. full > partition), if the block was successfully retrieved over the network. > This memory is not accounted towards Spark storage/execution memory and could > potentially lead to OOM if {{BlockManager}} fetches too many partitions in > parallel. I wonder if this is intentional behaviour, or in fact a bug? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19606) Support constraints in spark-dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207722#comment-16207722 ] Pascal GILLET edited comment on SPARK-19606 at 10/17/17 2:39 PM: - * _If "spark.mesos.constraints" is passed with the job then it will wind up overriding the value specified in the "driverDefault" property._: *False*. "spark.mesos.constraints" still applies for executors only, while the "driverDefault" will apply for the driver. * _If "spark.mesos.constraints" is not passed with the job, then the value specified in the "driverDefault" property will get passed to the executors - which we definitely don't want._: *True* OK then to add the "spark.mesos.constraints.driver" property. was (Author: pgillet): * _If "spark.mesos.constraints" is passed with the job then it will wind up overriding the value specified in the "driverDefault" property._: False. "spark.mesos.constraints" still applies for executors only, while the "driverDefault" will apply for the driver. * _If "spark.mesos.constraints" is not passed with the job, then the value specified in the "driverDefault" property will get passed to the executors - which we definitely don't want._: True OK then to add the "spark.mesos.constraints.driver" property. > Support constraints in spark-dispatcher > --- > > Key: SPARK-19606 > URL: https://issues.apache.org/jira/browse/SPARK-19606 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Philipp Hoffmann > > The `spark.mesos.constraints` configuration is ignored by the > spark-dispatcher. The constraints need to be passed in the Framework > information when registering with Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19606) Support constraints in spark-dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207722#comment-16207722 ] Pascal GILLET edited comment on SPARK-19606 at 10/17/17 2:38 PM: - * _If "spark.mesos.constraints" is passed with the job then it will wind up overriding the value specified in the "driverDefault" property._: False. "spark.mesos.constraints" still applies for executors only, while the "driverDefault" will apply for the driver. * _If "spark.mesos.constraints" is not passed with the job, then the value specified in the "driverDefault" property will get passed to the executors - which we definitely don't want._: True OK then to add the "spark.mesos.constraints.driver" property. was (Author: pgillet): * _If "spark.mesos.constraints" is passed with the job then it will wind up overriding the value specified in the "driverDefault" property. _: False. "spark.mesos.constraints" still applies for executors only, while the "driverDefault" will apply for the driver. * _If "spark.mesos.constraints" is not passed with the job, then the value specified in the "driverDefault" property will get passed to the executors - which we definitely don't want._: True OK then to add the "spark.mesos.constraints.driver" property. > Support constraints in spark-dispatcher > --- > > Key: SPARK-19606 > URL: https://issues.apache.org/jira/browse/SPARK-19606 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Philipp Hoffmann > > The `spark.mesos.constraints` configuration is ignored by the > spark-dispatcher. The constraints need to be passed in the Framework > information when registering with Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19606) Support constraints in spark-dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207722#comment-16207722 ] Pascal GILLET commented on SPARK-19606: --- * _If "spark.mesos.constraints" is passed with the job then it will wind up overriding the value specified in the "driverDefault" property. _: False. "spark.mesos.constraints" still applies for executors only, while the "driverDefault" will apply for the driver. * _If "spark.mesos.constraints" is not passed with the job, then the value specified in the "driverDefault" property will get passed to the executors - which we definitely don't want._: True OK then to add the "spark.mesos.constraints.driver" property. > Support constraints in spark-dispatcher > --- > > Key: SPARK-19606 > URL: https://issues.apache.org/jira/browse/SPARK-19606 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Philipp Hoffmann > > The `spark.mesos.constraints` configuration is ignored by the > spark-dispatcher. The constraints need to be passed in the Framework > information when registering with Mesos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20396) groupBy().apply() with pandas udf in pyspark
[ https://issues.apache.org/jira/browse/SPARK-20396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207711#comment-16207711 ] Apache Spark commented on SPARK-20396: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/19517 > groupBy().apply() with pandas udf in pyspark > > > Key: SPARK-20396 > URL: https://issues.apache.org/jira/browse/SPARK-20396 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Li Jin >Assignee: Li Jin > Fix For: 2.3.0 > > > split-apply-merge is a common pattern when analyzing data. It is implemented > in many popular data analyzing libraries such as Spark, Pandas, R, and etc. > Split and merge operations in these libraries are similar to each other, > mostly implemented by certain grouping operators. For instance, Spark > DataFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users > familiar with either Spark DataFrame or pandas DataFrame, it is not difficult > for them to understand how grouping works in the other library. However, > apply is more native to different libraries and therefore, quite different > between libraries. A pandas user knows how to use apply to do curtain > transformation in pandas might not know how to do the same using pyspark. > Also, the current implementation of passing data from the java executor to > python executor is not efficient, there is opportunity to speed it up using > Apache Arrow. This feature can enable use cases that uses Spark's grouping > operators such as groupBy, rollUp, cube, window and Pandas's native apply > operator. > Related work: > SPARK-13534 > This enables faster data serialization between Pyspark and Pandas using > Apache Arrow. Our work will be on top of this and use the same serialization > for pandas udf. > SPARK-12919 and SPARK-12922 > These implemented two functions: dapply and gapply in Spark R which > implements the similar split-apply-merge pattern that we want to implement > with Pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22288) Tricky interaction between closure-serialization and inheritance results in confusing failure
[ https://issues.apache.org/jira/browse/SPARK-22288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207706#comment-16207706 ] Ryan Williams commented on SPARK-22288: --- Makes sense, fine with me to "Won't Fix". bq. You can always use a different serializer like kryo NB: this is closure-serialization, where [only Java has ever worked afaik|https://issues.apache.org/jira/browse/SPARK-12414] > Tricky interaction between closure-serialization and inheritance results in > confusing failure > - > > Key: SPARK-22288 > URL: https://issues.apache.org/jira/browse/SPARK-22288 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > Documenting this since I've run into it a few times; [full repro / discussion > here|https://github.com/ryan-williams/spark-bugs/tree/serde]. > Given 3 possible super-classes: > {code} > class Super1(n: Int) > class Super2(n: Int) extends Serializable > class Super3 > {code} > A subclass that passes a closure to an RDD operation (e.g. {{map}} or > {{filter}}), where the closure references one of the subclass's fields, will > throw an {{java.io.InvalidClassException: …; no valid constructor}} exception > when the subclass extends {{Super1}} but not {{Super2}} or {{Super3}}. > Referencing method-local variables (instead of fields) is fine in all cases: > {code} > class App extends Super1(4) with Serializable { > val s = "abc" > def run(): Unit = { > val sc = new SparkContext(new SparkConf().set("spark.master", > "local[4]").set("spark.app.name", "serde-test")) > try { > sc > .parallelize(1 to 10) > .filter(Main.fn(_, s)) // danger! closure references `s`, crash > ensues > .collect() // driver stack-trace points here > } finally { > sc.stop() > } > } > } > object App { > def main(args: Array[String]): Unit = { new App().run() } > def fn(i: Int, s: String): Boolean = i % 2 == 0 > } > {code} > The task-failure stack trace looks like: > {code} > java.io.InvalidClassException: com.MyClass; no valid constructor > at > java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150) > at > java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1782) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) > {code} > and a driver stack-trace will point to the first line that initiates a Spark > job that exercises the closure/RDD-operation in question. > Not sure how much this should be considered a problem with Spark's > closure-serialization logic vs. Java serialization, but maybe if the former > gets looked at or improved (e.g. with 2.12 support), this kind of interaction > can be improved upon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207618#comment-16207618 ] Apache Spark commented on SPARK-22277: -- User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/19516 > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked| features| > +--+---+--+ > | 1.0|1.0|[0.0,0.0,18.0,1.0]| > | 0.0|0.0|[0.0,1.0,12.0,0.0]| > |
[jira] [Assigned] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22277: Assignee: (was: Apache Spark) > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked| features| > +--+---+--+ > | 1.0|1.0|[0.0,0.0,18.0,1.0]| > | 0.0|0.0|[0.0,1.0,12.0,0.0]| > | 0.0|0.0|[1.0,0.0,15.0,0.1]| > +--+---+--+ > Test Error = 0 -- This message
[jira] [Assigned] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22277: Assignee: Apache Spark > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu >Assignee: Apache Spark > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked| features| > +--+---+--+ > | 1.0|1.0|[0.0,0.0,18.0,1.0]| > | 0.0|0.0|[0.0,1.0,12.0,0.0]| > | 0.0|0.0|[1.0,0.0,15.0,0.1]| > +--+---+--+ > Test Error
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207573#comment-16207573 ] Steve Loughran commented on SPARK-2984: --- bq. multiple batches writing to same location simultaneously Hadoop {{FileOutputCommitter}} cleans up $dest/_temporary while committing or aborting a job. if you are writing >1 job to the same directory tree simultaneously, expect the job cleanup in one task to break the others. You could try overloading ParquetOutputCommitter.cleanupJob() to stop this, but it's probably safer to work out why output to the same path is happening in parallel and stop it. > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and > {noformat} > or
[jira] [Assigned] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22287: Assignee: (was: Apache Spark) > SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher > - > > Key: SPARK-22287 > URL: https://issues.apache.org/jira/browse/SPARK-22287 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: paul mackles >Priority: Minor > > There does not appear to be a way to control the heap size used by > MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not > honored for that particular daemon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207541#comment-16207541 ] Apache Spark commented on SPARK-22287: -- User 'pmackles' has created a pull request for this issue: https://github.com/apache/spark/pull/19515 > SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher > - > > Key: SPARK-22287 > URL: https://issues.apache.org/jira/browse/SPARK-22287 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: paul mackles >Priority: Minor > > There does not appear to be a way to control the heap size used by > MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not > honored for that particular daemon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22287) SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-22287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22287: Assignee: Apache Spark > SPARK_DAEMON_MEMORY not honored by MesosClusterDispatcher > - > > Key: SPARK-22287 > URL: https://issues.apache.org/jira/browse/SPARK-22287 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.1, 2.2.0, 2.3.0 >Reporter: paul mackles >Assignee: Apache Spark >Priority: Minor > > There does not appear to be a way to control the heap size used by > MesosClusterDispatcher as the SPARK_DAEMON_MEMORY environment variable is not > honored for that particular daemon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207540#comment-16207540 ] Apache Spark commented on SPARK-21551: -- User 'FRosner' has created a pull request for this issue: https://github.com/apache/spark/pull/19514 > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Assignee: peay >Priority: Critical > Fix For: 2.3.0 > > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207539#comment-16207539 ] Apache Spark commented on SPARK-21551: -- User 'FRosner' has created a pull request for this issue: https://github.com/apache/spark/pull/19513 > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Assignee: peay >Priority: Critical > Fix For: 2.3.0 > > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207538#comment-16207538 ] Apache Spark commented on SPARK-21551: -- User 'FRosner' has created a pull request for this issue: https://github.com/apache/spark/pull/19512 > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Assignee: peay >Priority: Critical > Fix For: 2.3.0 > > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207519#comment-16207519 ] Soumitra Sulav commented on SPARK-2984: --- I'm facing the same issues with Spark 2.0.2 on DC/OS with HDFS [Parquet files]. After few hours of streaming when the load increases and forces multiple batches writing to same location simultaneously, this error is reproduced. java.io.IOException: Failed to rename FileStatus{path=hdfs://namenode/xxx/xxx/parquet/snappy/customer_demographics/datetime=2017101710/_temporary/0/task_201710171034_0026_m_02/part-r-2-59deea14-b221-4a91-bcd3-78e2dbaf6b91.snappy.parquet; isDirectory=false; length=2614; replication=3; blocksize=134217728; modification_time=1508236440393; access_time=1508236440331; owner=root; group=hdfs; permission=rw-r--r--; isSymlink=false} to hdfs://namenode/xxx/xxx/parquet/snappy/customer_demographics//datetime=2017101710/part-r-2-59deea14-b221-4a91-bcd3-78e2dbaf6b91.snappy.parquet Spark configurations used : .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.sql.tungsten.enabled", "true") .set("spark.speculation","false") .set("spark.sql.parquet.mergeSchema", "false") .set("spark.mapreduce.fileoutputcommitter.algorithm.version", "2") Config in the properties file while spark submit : spark.streaming.concurrentJobs 4 > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.sc
[jira] [Commented] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207509#comment-16207509 ] Apache Spark commented on SPARK-22294: -- User 'ssaavedra' has created a pull request for this issue: https://github.com/apache/spark/pull/19427 > Reset spark.driver.bindAddress when starting a Checkpoint > - > > Key: SPARK-22294 > URL: https://issues.apache.org/jira/browse/SPARK-22294 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Santiago Saavedra > Labels: newbie > > On SPARK-4563 support for binding the driver to a different address than the > spark.driver.host was provided so that the driver could be running under a > differently-routed network. However, when the driver fails, the Checkpoint > restoring function expects that the {{spark.driver.bindAddress}} remains the > same even if the {{spark.driver.host}} variable may change. That limits the > capabilities of recovery under several cluster configurations, and we propose > that {{spark.driver.bindAddress}} should have the same replacement behaviour > as {{spark.driver.host}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22294: Assignee: (was: Apache Spark) > Reset spark.driver.bindAddress when starting a Checkpoint > - > > Key: SPARK-22294 > URL: https://issues.apache.org/jira/browse/SPARK-22294 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Santiago Saavedra > Labels: newbie > > On SPARK-4563 support for binding the driver to a different address than the > spark.driver.host was provided so that the driver could be running under a > differently-routed network. However, when the driver fails, the Checkpoint > restoring function expects that the {{spark.driver.bindAddress}} remains the > same even if the {{spark.driver.host}} variable may change. That limits the > capabilities of recovery under several cluster configurations, and we propose > that {{spark.driver.bindAddress}} should have the same replacement behaviour > as {{spark.driver.host}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22294: Assignee: Apache Spark > Reset spark.driver.bindAddress when starting a Checkpoint > - > > Key: SPARK-22294 > URL: https://issues.apache.org/jira/browse/SPARK-22294 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Santiago Saavedra >Assignee: Apache Spark > Labels: newbie > > On SPARK-4563 support for binding the driver to a different address than the > spark.driver.host was provided so that the driver could be running under a > differently-routed network. However, when the driver fails, the Checkpoint > restoring function expects that the {{spark.driver.bindAddress}} remains the > same even if the {{spark.driver.host}} variable may change. That limits the > capabilities of recovery under several cluster configurations, and we propose > that {{spark.driver.bindAddress}} should have the same replacement behaviour > as {{spark.driver.host}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22294) Reset spark.driver.bindAddress when starting a Checkpoint
Santiago Saavedra created SPARK-22294: - Summary: Reset spark.driver.bindAddress when starting a Checkpoint Key: SPARK-22294 URL: https://issues.apache.org/jira/browse/SPARK-22294 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 2.2.0, 2.1.0 Reporter: Santiago Saavedra On SPARK-4563 support for binding the driver to a different address than the spark.driver.host was provided so that the driver could be running under a differently-routed network. However, when the driver fails, the Checkpoint restoring function expects that the {{spark.driver.bindAddress}} remains the same even if the {{spark.driver.host}} variable may change. That limits the capabilities of recovery under several cluster configurations, and we propose that {{spark.driver.bindAddress}} should have the same replacement behaviour as {{spark.driver.host}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21459) Some aggregation functions change the case of nested field names
[ https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207270#comment-16207270 ] David Allsopp commented on SPARK-21459: --- Just trying to see when this problem was resolved: * It is present in 1.6.0, as originally reported * In 1.6.3 (currently the latest 1.6.x version), the {{collect_set}} aggregation operation fails with an {{org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectSet because: Only primitive type arguments are accepted but array> was passed as parameter 1..;}} * In 2.0.2 and 2.2.0 the problem has gone. > Some aggregation functions change the case of nested field names > > > Key: SPARK-21459 > URL: https://issues.apache.org/jira/browse/SPARK-21459 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: David Allsopp >Priority: Minor > > When working with DataFrames with nested schemas, the behavior of the > aggregation functions is inconsistent with respect to preserving the case of > the nested field names. > For example, {{first()}} preserves the case of the field names, but > {{collect_set()}} and {{collect_list()}} force the field names to lowercase. > Expected behavior: Field name case is preserved (or is at least consistent > and documented) > Spark-shell session to reproduce: > {code:java} > case class Inner(Key:String, Value:String) > case class Outer(ID:Long, Pairs:Array[Inner]) > val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") > val df = sqlContext.createDataFrame(rdd) > scala> df > ... = [ID: bigint, Pairs: array>] > scala>df.groupBy("ID").agg(first("Pairs")) > ... = [ID: bigint, first(Pairs)(): array>] > // Note that Key and Value preserve their original case > scala>df.groupBy("ID").agg(collect_set("Pairs")) > ... = [ID: bigint, collect_set(Pairs): array>] > // Note that key and value are now lowercased > {code} > Additionally, the column name (generated during aggregation) is inconsistent: > {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses > in the first name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22224) Override toString of KeyValueGroupedDataset & RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-4: --- Assignee: Kent Yao > Override toString of KeyValueGroupedDataset & RelationalGroupedDataset > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 2.3.0 > > > scala> val words = spark.read.textFile("README.md").flatMap(_.split(" ")) > words: org.apache.spark.sql.Dataset[String] = [value: string] > scala> val grouped = words.groupByKey(identity) > grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = > org.apache.spark.sql.KeyValueGroupedDataset@65214862 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22224) Override toString of KeyValueGroupedDataset & RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-4. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19363 [https://github.com/apache/spark/pull/19363] > Override toString of KeyValueGroupedDataset & RelationalGroupedDataset > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kent Yao >Priority: Minor > Fix For: 2.3.0 > > > scala> val words = spark.read.textFile("README.md").flatMap(_.split(" ")) > words: org.apache.spark.sql.Dataset[String] = [value: string] > scala> val grouped = words.groupByKey(identity) > grouped: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = > org.apache.spark.sql.KeyValueGroupedDataset@65214862 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21551) pyspark's collect fails when getaddrinfo is too slow
[ https://issues.apache.org/jira/browse/SPARK-21551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207220#comment-16207220 ] Frank Rosner commented on SPARK-21551: -- Do you guys mind if I backport this also to 2.0.x, 2.1.x, and 2.2.x? We are having some jobs that we don't want to upgrade to 2.3.0 but that are failing regularly because of this problem. Which branches would that have to go to? branch-2.0, branch-2.1, and branch-2.2? > pyspark's collect fails when getaddrinfo is too slow > > > Key: SPARK-21551 > URL: https://issues.apache.org/jira/browse/SPARK-21551 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: peay >Assignee: peay >Priority: Critical > Fix For: 2.3.0 > > > Pyspark's {{RDD.collect}}, as well as {{DataFrame.toLocalIterator}} and > {{DataFrame.collect}} all work by starting an ephemeral server in the driver, > and having Python connect to it to download the data. > All three are implemented along the lines of: > {code} > port = self._jdf.collectToPython() > return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( > {code} > The server has **a hardcoded timeout of 3 seconds** > (https://github.com/apache/spark/blob/e26dac5feb02033f980b1e69c9b0ff50869b6f9e/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L695) > -- i.e., the Python process has 3 seconds to connect to it from the very > moment the driver server starts. > In general, that seems fine, but I have been encountering frequent timeouts > leading to `Exception: could not open socket`. > After investigating a bit, it turns out that {{_load_from_socket}} makes a > call to {{getaddrinfo}}: > {code} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): >.. connect .. > {code} > I am not sure why, but while most such calls to {{getaddrinfo}} on my machine > only take a couple milliseconds, about 10% of them take between 2 and 10 > seconds, leading to about 10% of jobs failing. I don't think we can always > expect {{getaddrinfo}} to return instantly. More generally, Python may > sometimes pause for a couple seconds, which may not leave enough time for the > process to connect to the server. > Especially since the server timeout is hardcoded, I think it would be best to > set a rather generous value (15 seconds?) to avoid such situations. > A {{getaddrinfo}} specific fix could avoid doing it every single time, or do > it before starting up the driver server. > > cc SPARK-677 [~davies] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18649) sc.textFile(my_file).collect() raises socket.timeout on large files
[ https://issues.apache.org/jira/browse/SPARK-18649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207217#comment-16207217 ] Frank Rosner commented on SPARK-18649: -- Looks like in SPARK-21551 they increased the hard coded limit to 15 seconds. > sc.textFile(my_file).collect() raises socket.timeout on large files > --- > > Key: SPARK-18649 > URL: https://issues.apache.org/jira/browse/SPARK-18649 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: PySpark version 1.6.2 >Reporter: Erik Cederstrand > > I'm trying to load a file into the driver with this code: > contents = sc.textFile('hdfs://path/to/big_file.csv').collect() > Loading into the driver instead of creating a distributed RDD is intentional > in this case. The file is ca. 6GB, and I have adjusted driver memory > accordingly to fit the local data. After some time, my spark/submitted job > crashes with the stack trace below. > I have traced this to pyspark/rdd.py where the _load_from_socket() method > creates a socket with a hard-coded timeout of 3 seconds (this code is also > present in HEAD although I'm on PySpark 1.6.2). Raising this hard-coded value > to e.g. 600 lets me read the entire file. > Is there any reason that this value does not use e.g. the > 'spark.network.timeout' setting instead? > Traceback (most recent call last): > File "my_textfile_test.py", line 119, in > contents = sc.textFile('hdfs://path/to/file.csv').collect() > File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 772, in collect > File "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/rdd.py", > line 142, in _load_from_socket > File > "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 517, in load_stream > File > "/usr/hdp/2.5.0.0-1245/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 511, in loads > File "/usr/lib/python2.7/socket.py", line 380, in read > data = self._sock.recv(left) > socket.timeout: timed out > 16/11/30 13:33:14 WARN Utils: Suppressing exception in finally: Broken pipe > java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) > at java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at java.io.FilterOutputStream.close(FilterOutputStream.java:158) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$2.apply$mcV$sp(PythonRDD.scala:650) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1248) > at > org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649) > Suppressed: java.net.SocketException: Broken pipe > at java.net.SocketOutputStream.socketWrite0(Native Method) > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) > at > java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.FilterOutputStream.close(FilterOutputStream.java:158) > at java.io.FilterOutputStream.close(FilterOutputStream.java:159) > ... 3 more > 16/11/30 13:33:14 ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) > at java.net.SocketOutputStream.write(SocketOutputStream.java:153) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:622) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:442) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > or
[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207204#comment-16207204 ] Pranav Singhania commented on SPARK-16599: -- [~srowen] I have observed this happening with my code, but one common observation was that, it always occurred with the last job which failed every single time, if that could help you understand the cause of the bug. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22284) Code of class \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207183#comment-16207183 ] Liang-Chi Hsieh commented on SPARK-22284: - Btw, we have used {{UnsafeProjection}} in many places for expression codegen. Disabling wholestage codegen doesn't help it. Even non-wholestage code path uses expression codegen. > Code of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > -- > > Key: SPARK-22284 > URL: https://issues.apache.org/jira/browse/SPARK-22284 > Project: Spark > Issue Type: Bug > Components: Optimizer, PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Ben > > I am using pySpark 2.1.0 in a production environment, and trying to join two > DataFrames, one of which is very large and has complex nested structures. > Basically, I load both DataFrames and cache them. > Then, in the large DataFrame, I extract 3 nested values and save them as > direct columns. > Finally, I join on these three columns with the smaller DataFrame. > This would be a short code for this: > {code} > dataFrame.read..cache() > dataFrameSmall.read...cache() > dataFrame = dataFrame.selectExpr(['*','nested.Value1 AS > Value1','nested.Value2 AS Value2','nested.Value3 AS Value3']) > dataFrame = dataFrame.dropDuplicates().join(dataFrameSmall, > ['Value1','Value2',Value3']) > dataFrame.count() > {code} > And this is the error I get when it gets to the count(): > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in > stage 7.0 failed 4 times, most recent failure: Lost task 11.3 in stage 7.0 > (TID 11234, somehost.com, executor 10): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > \"apply_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V\" > of class > \"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection\" > grows beyond 64 KB > {code} > I have seen many tickets with similar issues here, but no proper solution. > Most of the fixes are until Spark 2.1.0 so I don't know if running it on > Spark 2.2.0 would fix it. In any case I cannot change the version of Spark > since it is in production. > I have also tried setting > {code:java} > spark.sql.codegen.wholeStage=false > {code} > but still the same error. > The job worked well up to now, also with large datasets, but apparently this > batch got larger, and that is the only thing that changed. Is there any > workaround for this? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.
[ https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207146#comment-16207146 ] Peng Meng commented on SPARK-22277: --- This seems is a bug. If no one is working on it. I can work on it. > Chi Square selector garbling Vector content. > > > Key: SPARK-22277 > URL: https://issues.apache.org/jira/browse/SPARK-22277 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Cheburakshu > > There is a difference in behavior when Chisquare selector is used v direct > feature use in decision tree classifier. > In the below code, I have used chisquare selector as a thru' pass but the > decision tree classifier is unable to process it. But, it is able to process > when the features are used directly. > The example is pulled out directly from Apache spark python documentation. > Kindly help. > {code:python} > from pyspark.ml.feature import ChiSqSelector > from pyspark.ml.linalg import Vectors > import sys > df = spark.createDataFrame([ > (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,), > (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,), > (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", > "clicked"]) > # ChiSq selector will just be a pass-through. All four featuresin the i/p > will be in output also. > selector = ChiSqSelector(numTopFeatures=4, featuresCol="features", > outputCol="selectedFeatures", labelCol="clicked") > result = selector.fit(df).transform(df) > print("ChiSqSelector output with top %d features selected" % > selector.getNumTopFeatures()) > from pyspark.ml.classification import DecisionTreeClassifier > try: > # Fails > dt = > DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures") > model = dt.fit(result) > except: > print(sys.exc_info()) > #Works > dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features") > model = dt.fit(df) > > # Make predictions. Using same dataset, not splitting!! > predictions = model.transform(result) > # Select example rows to display. > predictions.select("prediction", "clicked", "features").show(5) > # Select (prediction, true label) and compute test error > evaluator = MulticlassClassificationEvaluator( > labelCol="clicked", predictionCol="prediction", metricName="accuracy") > accuracy = evaluator.evaluate(predictions) > print("Test Error = %g " % (1.0 - accuracy)) > {code} > Output: > ChiSqSelector output with top 4 features selected > (, > IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but > it does not have the number of values specified.', > 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t > at > org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at > org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t > at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at > org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at > sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t > at java.lang.reflect.Method.invoke(Method.java:498)\n\t at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at > py4j.Gateway.invoke(Gateway.java:280)\n\t at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at > py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at > py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at > java.lang.Thread.run(Thread.java:745)'), ) > +--+---+--+ > |prediction|clicked| features| > +--+---+--+ > | 1.0|1.0|[0.0,0.0,18.0,1.0]| > | 0.0|0.0|[0.0,1.0,12.0,0.0]| > | 0.0|0.0|[1.0,0.0,15.0,0.1]| > +-
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207133#comment-16207133 ] Nick Pentreath commented on SPARK-22289: I think option (2) is the more general fix here. > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22249. --- Resolution: Fixed Resolved by https://github.com/apache/spark/pull/19494 > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at scala.collection.Iterator$$anon$12.
[jira] [Assigned] (SPARK-22249) UnsupportedOperationException: empty.reduceLeft when caching a dataframe
[ https://issues.apache.org/jira/browse/SPARK-22249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22249: - Assignee: Marco Gaido Fix Version/s: 2.3.0 2.2.1 > UnsupportedOperationException: empty.reduceLeft when caching a dataframe > > > Key: SPARK-22249 > URL: https://issues.apache.org/jira/browse/SPARK-22249 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1, 2.2.0 > Environment: $ uname -a > Darwin MAC-UM-024.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 > 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64 > $ pyspark --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.2.0 > /_/ > > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_92 > Branch > Compiled by user jenkins on 2017-06-30T22:58:04Z > Revision > Url >Reporter: Andreas Maier >Assignee: Marco Gaido > Fix For: 2.2.1, 2.3.0 > > > It seems that the {{isin()}} method with an empty list as argument only > works, if the dataframe is not cached. If it is cached, it results in an > exception. To reproduce > {code:java} > $ pyspark > >>> df = spark.createDataFrame([pyspark.Row(KEY="value")]) > >>> df.where(df["KEY"].isin([])).show() > +---+ > |KEY| > +---+ > +---+ > >>> df.cache() > DataFrame[KEY: string] > >>> df.where(df["KEY"].isin([])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/dataframe.py", > line 336, in show > print(self._jdf.showString(n, 20)) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", > line 1133, in __call__ > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/anaconda3/envs//lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o302.showString. > : java.lang.UnsupportedOperationException: empty.reduceLeft > at > scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:180) > at > scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:74) > at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:48) > at > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > at scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:107) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:71) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:112) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:111) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:111) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$3.apply(SparkStrategies.scala:307) > at > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:99) > at > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:303) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > at scala.collection.Iterator$$
[jira] [Resolved] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized
[ https://issues.apache.org/jira/browse/SPARK-19317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19317. --- Resolution: Duplicate > UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized > - > > Key: SPARK-19317 > URL: https://issues.apache.org/jira/browse/SPARK-19317 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Barry Becker >Priority: Minor > > I wish I had more of a simple reproducible case to give, but I got the below > exception while selecting null values in one of the columns of a dataframe. > My client code that failed was > df.filter(filterExp).count() > where the filter expression was something like "someCall.isin() || > someCall.isNulll". Based on the stack trace, I think the problem may be the > "isin()" in the expression - since there are no values to match in it. > There were 412 nulls out of 716,000 total rows for the column being filtered. > The exception seems to indicate that spark is trying to do reduceLeft on an > empty list, but the dataset is not empty. > {code} > java.lang.UnsupportedOperationException: > empty.reduceLeftscala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:137) > scala.collection.immutable.List.reduceLeft(List.scala:84) > scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) > scala.collection.AbstractTraversable.reduce(Traversable.scala:104) > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:90) > > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54) > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:61) > > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54) > scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) > scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:95) > > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:94) > > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > scala.collection.immutable.List.foreach(List.scala:381) > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > scala.collection.immutable.List.flatMap(List.scala:344) > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:94) > > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306) > > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306) > > org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:96) > > org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:302) > > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62) > scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > scala.collection.Iterator$class.foreach(Iterator.scala:893) > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > org.apache.spa
[jira] [Assigned] (SPARK-20992) Link to Nomad scheduler backend in docs
[ https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20992: - Assignee: Ben Barnard > Link to Nomad scheduler backend in docs > --- > > Key: SPARK-20992 > URL: https://issues.apache.org/jira/browse/SPARK-20992 > Project: Spark > Issue Type: Documentation > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Ben Barnard >Assignee: Ben Barnard >Priority: Trivial > Fix For: 2.3.0 > > > It is convenient to have scheduler backend support for running applications > on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so > that users can run Spark applications on a Nomad cluster without the need to > bring up a Spark Standalone cluster in the Nomad cluster. > Both client and cluster deploy modes should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20992) Link to Nomad scheduler backend in docs
[ https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20992. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19354 [https://github.com/apache/spark/pull/19354] > Link to Nomad scheduler backend in docs > --- > > Key: SPARK-20992 > URL: https://issues.apache.org/jira/browse/SPARK-20992 > Project: Spark > Issue Type: Documentation > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Ben Barnard >Priority: Trivial > Fix For: 2.3.0 > > > It is convenient to have scheduler backend support for running applications > on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so > that users can run Spark applications on a Nomad cluster without the need to > bring up a Spark Standalone cluster in the Nomad cluster. > Both client and cluster deploy modes should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20992) Link to Nomad scheduler backend in docs
[ https://issues.apache.org/jira/browse/SPARK-20992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20992: -- Priority: Trivial (was: Major) Issue Type: Documentation (was: New Feature) Summary: Link to Nomad scheduler backend in docs (was: Support for Nomad as scheduler backend) Changing this to be about linking to Nomad for now > Link to Nomad scheduler backend in docs > --- > > Key: SPARK-20992 > URL: https://issues.apache.org/jira/browse/SPARK-20992 > Project: Spark > Issue Type: Documentation > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Ben Barnard >Priority: Trivial > > It is convenient to have scheduler backend support for running applications > on [Nomad|https://github.com/hashicorp/nomad], as with YARN and Mesos, so > that users can run Spark applications on a Nomad cluster without the need to > bring up a Spark Standalone cluster in the Nomad cluster. > Both client and cluster deploy modes should be supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22288) Tricky interaction between closure-serialization and inheritance results in confusing failure
[ https://issues.apache.org/jira/browse/SPARK-22288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207116#comment-16207116 ] Sean Owen commented on SPARK-22288: --- I think this is a Java serialization question, not Spark. Still I'm curious why it fails here since it looks like the child class has an no-arg constructor and is serializable, and it's odd that case 1 works but not case 2. I think this might be the difference somehow between scala.Serializable and java.io.Serializable as per https://stackoverflow.com/a/27566890/64174 I don't think this is something Spark can do anything about though. You can always use a different serializer like kryo > Tricky interaction between closure-serialization and inheritance results in > confusing failure > - > > Key: SPARK-22288 > URL: https://issues.apache.org/jira/browse/SPARK-22288 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > Documenting this since I've run into it a few times; [full repro / discussion > here|https://github.com/ryan-williams/spark-bugs/tree/serde]. > Given 3 possible super-classes: > {code} > class Super1(n: Int) > class Super2(n: Int) extends Serializable > class Super3 > {code} > A subclass that passes a closure to an RDD operation (e.g. {{map}} or > {{filter}}), where the closure references one of the subclass's fields, will > throw an {{java.io.InvalidClassException: …; no valid constructor}} exception > when the subclass extends {{Super1}} but not {{Super2}} or {{Super3}}. > Referencing method-local variables (instead of fields) is fine in all cases: > {code} > class App extends Super1(4) with Serializable { > val s = "abc" > def run(): Unit = { > val sc = new SparkContext(new SparkConf().set("spark.master", > "local[4]").set("spark.app.name", "serde-test")) > try { > sc > .parallelize(1 to 10) > .filter(Main.fn(_, s)) // danger! closure references `s`, crash > ensues > .collect() // driver stack-trace points here > } finally { > sc.stop() > } > } > } > object App { > def main(args: Array[String]): Unit = { new App().run() } > def fn(i: Int, s: String): Boolean = i % 2 == 0 > } > {code} > The task-failure stack trace looks like: > {code} > java.io.InvalidClassException: com.MyClass; no valid constructor > at > java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:150) > at > java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:790) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1782) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) > {code} > and a driver stack-trace will point to the first line that initiates a Spark > job that exercises the closure/RDD-operation in question. > Not sure how much this should be considered a problem with Spark's > closure-serialization logic vs. Java serialization, but maybe if the former > gets looked at or improved (e.g. with 2.12 support), this kind of interaction > can be improved upon. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207115#comment-16207115 ] yuhao yang commented on SPARK-22289: cc [~yanboliang] [~dbtsai] > Cannot save LogisticRegressionClassificationModel with bounds on coefficients > - > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21459) Some aggregation functions change the case of nested field names
[ https://issues.apache.org/jira/browse/SPARK-21459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21459. --- Resolution: Cannot Reproduce > Some aggregation functions change the case of nested field names > > > Key: SPARK-21459 > URL: https://issues.apache.org/jira/browse/SPARK-21459 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: David Allsopp >Priority: Minor > > When working with DataFrames with nested schemas, the behavior of the > aggregation functions is inconsistent with respect to preserving the case of > the nested field names. > For example, {{first()}} preserves the case of the field names, but > {{collect_set()}} and {{collect_list()}} force the field names to lowercase. > Expected behavior: Field name case is preserved (or is at least consistent > and documented) > Spark-shell session to reproduce: > {code:java} > case class Inner(Key:String, Value:String) > case class Outer(ID:Long, Pairs:Array[Inner]) > val rdd = sc.parallelize(Seq(Outer(1L, Array(Inner("foo", "bar") > val df = sqlContext.createDataFrame(rdd) > scala> df > ... = [ID: bigint, Pairs: array>] > scala>df.groupBy("ID").agg(first("Pairs")) > ... = [ID: bigint, first(Pairs)(): array>] > // Note that Key and Value preserve their original case > scala>df.groupBy("ID").agg(collect_set("Pairs")) > ... = [ID: bigint, collect_set(Pairs): array>] > // Note that key and value are now lowercased > {code} > Additionally, the column name (generated during aggregation) is inconsistent: > {{first(Pairs)()}} versus {{collect_set(Pairs)}} - note the extra parentheses > in the first name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21697) NPE & ExceptionInInitializerError trying to load UTF from HDFS
[ https://issues.apache.org/jira/browse/SPARK-21697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207098#comment-16207098 ] Bang Xiao commented on SPARK-21697: --- in the case describe above, i added "spark jars : file:///xxx.jar" in conf/spark-defaults.conf, then when i use "add jar" through SparkSQL CLI in yarn-client mode, it occurs the error. if i added "spark jars : file:///xxx.jar, hdfs://xxx.jar" in conf/spark-defaults.conf. i can "add jar hdfs:///.jar" successfully through SparkSQL CLI in yarn-client mode > NPE & ExceptionInInitializerError trying to load UTF from HDFS > -- > > Key: SPARK-21697 > URL: https://issues.apache.org/jira/browse/SPARK-21697 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Spark Client mode, Hadoop 2.6.0 >Reporter: Steve Loughran >Priority: Minor > > Reported on [the > PR|https://github.com/apache/spark/pull/17342#issuecomment-321438157] for > SPARK-12868: trying to load a UDF of HDFS is triggering an > {{ExceptionInInitializerError}}, caused by an NPE which should only happen if > the commons-logging {{LOG}} log is null. > Hypothesis: the commons logging scan for {{commons-logging.properties}} is > happening in the classpath with the HDFS JAR; this is triggering a D/L of the > JAR, which needs to force in commons-logging, and, as that's not inited yet, > NPEs -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22264) History server will be unavailable if there is an event log file with large size
[ https://issues.apache.org/jira/browse/SPARK-22264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang resolved SPARK-22264. -- Resolution: Duplicate > History server will be unavailable if there is an event log file with large > size > > > Key: SPARK-22264 > URL: https://issues.apache.org/jira/browse/SPARK-22264 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang > Attachments: not-found.png > > > History server will be unavailable if there is an event log file with large > size. > Large size here means the replaying time is too long. > *We can fix this to add a timeout for event log replaying.* > *Here is an example:* > Every application submitted after restart can not open history ui. > !not-found.png! > *From event log directory we can find an event log file size is bigger than > 130GB.* > {code:java} > hadoop *144149840801* 2017-08-29 14:03 > /spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress > {code} > *and from jstack and server log we can see replaying task blocked on this > event log:* > *server log:* > {code:java} > 2017-10-12,16:00:12,151 INFO > org.apache.spark.deploy.history.FsHistoryProvider: Replaying log path: > hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress > 2017-10-12,16:00:12,167 INFO org.apache.spark.scheduler.ReplayListenerBus: > Begin to replay > hdfs://xxx/spark/xxx/log/history/application_1501588539284_1118255_1.lz4.inprogress! > {code} > *jstack* > {code:java} > "log-replay-executor-0" daemon prio=10 tid=0x7f0f48014800 nid=0x6160 > runnable [0x7f0f4f6f5000] >java.lang.Thread.State: RUNNABLE > at net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast(Native Method) > at > net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:37) > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:205) > at > org.apache.spark.io.LZ4BlockInputStream.read(LZ4BlockInputStream.java:125) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) > - locked <0x0005f0096948> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:154) > at java.io.BufferedReader.readLine(BufferedReader.java:317) > - locked <0x0005f0096948> (a java.io.InputStreamReader) > at java.io.BufferedReader.readLine(BufferedReader.java:382) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:79) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:776) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:584) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$3$$anon$4.run(FsHistoryProvider.scala:464) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org