[jira] [Updated] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code
[ https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10657: --- Summary: Remove legacy SCP-based Jenkins log archiving code (was: Remove legacy SSH-based Jenkins log archiving code) > Remove legacy SCP-based Jenkins log archiving code > -- > > Key: SPARK-10657 > URL: https://issues.apache.org/jira/browse/SPARK-10657 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to > use our custom SSH-based mechanism for archiving Jenkins logs on the master > machine; this has been superseded by the use of a Jenkins plugin which > archives the logs and provides public viewing of them. > We should remove the legacy log syncing code, since this is a blocker to > disabling Worker -> Master SSH on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code
[ https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-10657: --- Description: As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public viewing of them. We should remove the legacy log syncing code, since this is a blocker to disabling Worker -> Master SSH on Jenkins. was: As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SSH-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public viewing of them. We should remove the legacy log syncing code, since this is a blocker to disabling Worker -> Master SSH on Jenkins. > Remove legacy SCP-based Jenkins log archiving code > -- > > Key: SPARK-10657 > URL: https://issues.apache.org/jira/browse/SPARK-10657 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to > use our custom SCP-based mechanism for archiving Jenkins logs on the master > machine; this has been superseded by the use of a Jenkins plugin which > archives the logs and provides public viewing of them. > We should remove the legacy log syncing code, since this is a blocker to > disabling Worker -> Master SSH on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code
[ https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10657: Assignee: Josh Rosen (was: Apache Spark) > Remove legacy SCP-based Jenkins log archiving code > -- > > Key: SPARK-10657 > URL: https://issues.apache.org/jira/browse/SPARK-10657 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to > use our custom SCP-based mechanism for archiving Jenkins logs on the master > machine; this has been superseded by the use of a Jenkins plugin which > archives the logs and provides public viewing of them. > We should remove the legacy log syncing code, since this is a blocker to > disabling Worker -> Master SSH on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10658) Could pyspark provide addJars() as scala spark API?
ryanchou created SPARK-10658: Summary: Could pyspark provide addJars() as scala spark API? Key: SPARK-10658 URL: https://issues.apache.org/jira/browse/SPARK-10658 Project: Spark Issue Type: Wish Components: PySpark Affects Versions: 1.3.1 Environment: Linux ubuntu 14.01 LTS Reporter: ryanchou My spark program was written by pyspark API , and it has used the spark-csv jar library. I could submit the task by spark-submit, and add `--jars` arguments for using spark-csv jar library as following commands: ``` /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py ``` However I need to run my unittests like: ``` py.test -vvs test_xxx.py ``` It could't add jars by adding '--jars' arugment. Therefore I tried to use the SparkContext.addPyFile() API to add jars in my test_xxx.py. Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, .py, .jar). Does it mean that I could add *.jar (jar libraries) by using the addPyFile()? The codes which use addPyFile() to add jars as below: ``` self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar")) sqlContext = SQLContext(self.sc) self.dataframe = sqlContext.load( source="com.databricks.spark.csv", header="true", path="xxx.csv" ) ``` While it doesn't work. sqlContext cannot load the source(com.databricks.spark.csv) Eventually I have found another way to set the enviroment variable SPARK_CLASSPATH for loading jars libraries ``` SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py ``` It could load the jars libraries and sqlContext could load source succeed as well as adding `--jar xxx1.jar` arguments For the siuation on using third party jars (.py & .egg could work well by using addPyFile()) in pyspark-written scripts. and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py). have you ever planed to provide an API such as addJars() in scala for adding jars to spark program, or was there a better way to add jars I still havent found it yet? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10658) Could pyspark provide addJars() as scala spark API?
[ https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ryanchou updated SPARK-10658: - Description: My spark program was written by pyspark API , and it has used the spark-csv jar library. I could submit the task by spark-submit, and add `--jars` arguments for using spark-csv jar library as following commands: ``` /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py ``` However I need to run my unittests like: ``` py.test -vvs test_xxx.py ``` It could't add jars by adding '--jars' arugment. Therefore I tried to use the SparkContext.addPyFile() API to add jars in my test_xxx.py. Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, .py, .jar). Does it mean that I could add *.jar (jar libraries) by using the addPyFile()? The codes which using addPyFile() to add jars as below: ``` self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar")) sqlContext = SQLContext(self.sc) self.dataframe = sqlContext.load( source="com.databricks.spark.csv", header="true", path="xxx.csv" ) ``` While it doesn't work. sqlContext cannot load the source(com.databricks.spark.csv) Eventually I have found another way to set the enviroment variable SPARK_CLASSPATH for loading jars libraries ``` SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py ``` It could load the jars libraries and sqlContext could load source succeed as well as adding `--jar xxx1.jar` arguments For the situation on using third party jars (.py & .egg could work well by using addPyFile()) in pyspark-written scripts. and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py). Have you ever planed to provide an API such as addJars() in scala for adding jars to spark program, or was there a better way to add jars I still havent found it yet? If someone want to addjars() in pyspark-written scripts not using '--jars'. Could you give us some suggestions on it? was: My spark program was written by pyspark API , and it has used the spark-csv jar library. I could submit the task by spark-submit, and add `--jars` arguments for using spark-csv jar library as following commands: ``` /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py ``` However I need to run my unittests like: ``` py.test -vvs test_xxx.py ``` It could't add jars by adding '--jars' arugment. Therefore I tried to use the SparkContext.addPyFile() API to add jars in my test_xxx.py. Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, .py, .jar). Does it mean that I could add *.jar (jar libraries) by using the addPyFile()? The codes which using addPyFile() to add jars as below: ``` self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar")) sqlContext = SQLContext(self.sc) self.dataframe = sqlContext.load( source="com.databricks.spark.csv", header="true", path="xxx.csv" ) ``` While it doesn't work. sqlContext cannot load the source(com.databricks.spark.csv) Eventually I have found another way to set the enviroment variable SPARK_CLASSPATH for loading jars libraries ``` SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py ``` It could load the jars libraries and sqlContext could load source succeed as well as adding `--jar xxx1.jar` arguments For the situation on using third party jars (.py & .egg could work well by using addPyFile()) in pyspark-written scripts. and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py). Have you ever planed to provide an API such as addJars() in scala for adding jars to spark program, or was there a better way to add jars I still havent found it yet? > Could pyspark provide addJars() as scala spark API? > > > Key: SPARK-10658 > URL: https://issues.apache.org/jira/browse/SPARK-10658 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 1.3.1 > Environment: Linux ubuntu 14.01 LTS >Reporter: ryanchou > Labels: features > Original Estimate: 48h > Remaining Estimate: 48h > > My spark program was written by pyspark API , and it has used the spark-csv > jar library. > I could submit the task by spark-submit, and add `--jars` arguments for using > spark-csv jar library as following commands: > ``` > /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py > ``` > However I need to run my unittests like: > ``` > py.test -vvs test_xxx.py > ``` > It could't add jars by adding '--jars' arugment. > Therefore I tried to use the SparkContext.addPyFile() API
[jira] [Commented] (SPARK-10606) Cube/Rollup/GrpSet doesn't create the correct plan when group by is on something other than an AttributeReference
[ https://issues.apache.org/jira/browse/SPARK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791499#comment-14791499 ] Cheng Hao commented on SPARK-10606: --- [~rhbutani] Which version are you using, actually I've fixed the bug at SPARK-8972, it should be included in 1.5. Can you try that with 1.5? > Cube/Rollup/GrpSet doesn't create the correct plan when group by is on > something other than an AttributeReference > - > > Key: SPARK-10606 > URL: https://issues.apache.org/jira/browse/SPARK-10606 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Harish Butani >Priority: Critical > > Consider the following table: t(a : String, b : String) and the query > {code} > select a, concat(b, '1'), count(*) > from t > group by a, concat(b, '1') with cube > {code} > The projections in the Expand operator are not setup correctly. The expand > logic in Analyzer:expand is comparing grouping expressions against > child.output. So {{concat(b, '1')}} is never mapped to a null Literal. > A simple fix is to add a Rule to introduce a Projection below the > Cube/Rollup/GrpSet operator that additionally projects the > groupingExpressions that are missing in the child. > Marking this as Critical, because you get wrong results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10659) SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema
Vladimir Picka created SPARK-10659: -- Summary: SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema Key: SPARK-10659 URL: https://issues.apache.org/jira/browse/SPARK-10659 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0 Reporter: Vladimir Picka DataFrames currently automatically promotes all Parquet schema fields to optional when they are written to an empty directory. The problem remains in v1.5.0. The culprit is this code: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame( data.queryExecution.toRdd, data.schema.asNullable) val createdRelation = createRelation(sqlContext, parameters, df.schema).asInstanceOf[ParquetRelation2] createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) createdRelation } which was implemented as part of this PR: https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b This very unexpected behaviour for some use cases when files are read from one place and written to another like small file packing - it ends up with incompatible files because required can't be promoted to optional normally. It is essence of a schema that it enforces "required" invariant on data. It should be supposed that it is intended. I believe that a better approach is to have default behaviour to keep schema as is and provide f.e. a builder method or option to allow forcing to optional. Right now we have to overwrite private API so that our files are rewritten as is with all its perils. Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Picka updated SPARK-10659: --- Summary: DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema (was: SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema) > DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not > nullable) flag in schema > -- > > Key: SPARK-10659 > URL: https://issues.apache.org/jira/browse/SPARK-10659 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0 >Reporter: Vladimir Picka > > DataFrames currently automatically promotes all Parquet schema fields to > optional when they are written to an empty directory. The problem remains in > v1.5.0. > The culprit is this code: > val relation = if (doInsertion) { > // This is a hack. We always set > nullable/containsNull/valueContainsNull to true > // for the schema of a parquet data. > val df = > sqlContext.createDataFrame( > data.queryExecution.toRdd, > data.schema.asNullable) > val createdRelation = > createRelation(sqlContext, parameters, > df.schema).asInstanceOf[ParquetRelation2] > createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) > createdRelation > } > which was implemented as part of this PR: > https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b > This very unexpected behaviour for some use cases when files are read from > one place and written to another like small file packing - it ends up with > incompatible files because required can't be promoted to optional normally. > It is essence of a schema that it enforces "required" invariant on data. It > should be supposed that it is intended. > I believe that a better approach is to have default behaviour to keep schema > as is and provide f.e. a builder method or option to allow forcing to > optional. > Right now we have to overwrite private API so that our files are rewritten as > is with all its perils. > Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema
[ https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791613#comment-14791613 ] Vladimir Picka commented on SPARK-10659: Here is unanswered attempt for discussion on a mailing list: https://mail.google.com/mail/#search/label%3Aspark-user+petr/14f64c75c15f5ccd > DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not > nullable) flag in schema > -- > > Key: SPARK-10659 > URL: https://issues.apache.org/jira/browse/SPARK-10659 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0 >Reporter: Vladimir Picka > > DataFrames currently automatically promotes all Parquet schema fields to > optional when they are written to an empty directory. The problem remains in > v1.5.0. > The culprit is this code: > val relation = if (doInsertion) { > // This is a hack. We always set > nullable/containsNull/valueContainsNull to true > // for the schema of a parquet data. > val df = > sqlContext.createDataFrame( > data.queryExecution.toRdd, > data.schema.asNullable) > val createdRelation = > createRelation(sqlContext, parameters, > df.schema).asInstanceOf[ParquetRelation2] > createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite) > createdRelation > } > which was implemented as part of this PR: > https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b > This very unexpected behaviour for some use cases when files are read from > one place and written to another like small file packing - it ends up with > incompatible files because required can't be promoted to optional normally. > It is essence of a schema that it enforces "required" invariant on data. It > should be supposed that it is intended. > I believe that a better approach is to have default behaviour to keep schema > as is and provide f.e. a builder method or option to allow forcing to > optional. > Right now we have to overwrite private API so that our files are rewritten as > is with all its perils. > Vladimir -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8547) xgboost exploration
[ https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791472#comment-14791472 ] Tian Jian Wang commented on SPARK-8547: --- Venkata I have started on this as a pet project before. If you have yet started, can I try? > xgboost exploration > --- > > Key: SPARK-8547 > URL: https://issues.apache.org/jira/browse/SPARK-8547 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Joseph K. Bradley > > There has been quite a bit of excitement around xgboost: > [https://github.com/dmlc/xgboost] > It improves the parallelism of boosting by mixing boosting and bagging (where > bagging makes the algorithm more parallel). > It would worth exploring implementing this within MLlib (probably as a new > algorithm). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10657) Remove legacy SSH-based Jenkins log archiving code
Josh Rosen created SPARK-10657: -- Summary: Remove legacy SSH-based Jenkins log archiving code Key: SPARK-10657 URL: https://issues.apache.org/jira/browse/SPARK-10657 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Josh Rosen Assignee: Josh Rosen As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SSH-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public viewing of them. We should remove the legacy log syncing code, since this is a blocker to disabling Worker -> Master SSH on Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10658) Could pyspark provide addJars() as scala spark API?
[ https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ryanchou updated SPARK-10658: - Description: My spark program was written by pyspark API , and it has used the spark-csv jar library. I could submit the task by spark-submit, and add `--jars` arguments for using spark-csv jar library as following commands: ``` /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py ``` However I need to run my unittests like: ``` py.test -vvs test_xxx.py ``` It could't add jars by adding '--jars' arugment. Therefore I tried to use the SparkContext.addPyFile() API to add jars in my test_xxx.py. Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, .py, .jar). Does it mean that I could add *.jar (jar libraries) by using the addPyFile()? The codes which using addPyFile() to add jars as below: ``` self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar")) sqlContext = SQLContext(self.sc) self.dataframe = sqlContext.load( source="com.databricks.spark.csv", header="true", path="xxx.csv" ) ``` While it doesn't work. sqlContext cannot load the source(com.databricks.spark.csv) Eventually I have found another way to set the enviroment variable SPARK_CLASSPATH for loading jars libraries ``` SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py ``` It could load the jars libraries and sqlContext could load source succeed as well as adding `--jar xxx1.jar` arguments For the situation on using third party jars (.py & .egg could work well by using addPyFile()) in pyspark-written scripts. and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py). Have you ever planed to provide an API such as addJars() in scala for adding jars to spark program, or was there a better way to add jars I still havent found it yet? was: My spark program was written by pyspark API , and it has used the spark-csv jar library. I could submit the task by spark-submit, and add `--jars` arguments for using spark-csv jar library as following commands: ``` /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py ``` However I need to run my unittests like: ``` py.test -vvs test_xxx.py ``` It could't add jars by adding '--jars' arugment. Therefore I tried to use the SparkContext.addPyFile() API to add jars in my test_xxx.py. Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, .py, .jar). Does it mean that I could add *.jar (jar libraries) by using the addPyFile()? The codes which use addPyFile() to add jars as below: ``` self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar")) sqlContext = SQLContext(self.sc) self.dataframe = sqlContext.load( source="com.databricks.spark.csv", header="true", path="xxx.csv" ) ``` While it doesn't work. sqlContext cannot load the source(com.databricks.spark.csv) Eventually I have found another way to set the enviroment variable SPARK_CLASSPATH for loading jars libraries ``` SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py ``` It could load the jars libraries and sqlContext could load source succeed as well as adding `--jar xxx1.jar` arguments For the siuation on using third party jars (.py & .egg could work well by using addPyFile()) in pyspark-written scripts. and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py). have you ever planed to provide an API such as addJars() in scala for adding jars to spark program, or was there a better way to add jars I still havent found it yet? > Could pyspark provide addJars() as scala spark API? > > > Key: SPARK-10658 > URL: https://issues.apache.org/jira/browse/SPARK-10658 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 1.3.1 > Environment: Linux ubuntu 14.01 LTS >Reporter: ryanchou > Labels: features > Original Estimate: 48h > Remaining Estimate: 48h > > My spark program was written by pyspark API , and it has used the spark-csv > jar library. > I could submit the task by spark-submit, and add `--jars` arguments for using > spark-csv jar library as following commands: > ``` > /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar xxx.py > ``` > However I need to run my unittests like: > ``` > py.test -vvs test_xxx.py > ``` > It could't add jars by adding '--jars' arugment. > Therefore I tried to use the SparkContext.addPyFile() API to add jars in my > test_xxx.py. > Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, > .py,
[jira] [Resolved] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used
[ https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10504. -- Resolution: Fixed Assignee: Yin Huai Fix Version/s: 1.5.0 > aggregate where NULL is defined as the value expression aborts when SUM used > > > Key: SPARK-10504 > URL: https://issues.apache.org/jira/browse/SPARK-10504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1 >Reporter: N Campbell >Assignee: Yin Huai >Priority: Minor > Fix For: 1.5.0 > > > In ISO-SQL the context would determine an implicit type for NULL or one might > find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It > appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but > a query such the following aborts. > > {{select sum ( null ) from tversion}} > {code} > Operation: execute > Errors: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of > class org.apache.spark.sql.types.NullType$) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403) > at > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422) > at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422) > at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119) > at > org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51) > at > org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82) > at > org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133) > at > org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2496) Compression streams should write its codec info to the stream
[ https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2496. --- Resolution: Incomplete Resolving as "Incomplete"; if we still want to do this then we should wait until we have a specific concrete use-case / list of things that need to be changed. > Compression streams should write its codec info to the stream > - > > Key: SPARK-2496 > URL: https://issues.apache.org/jira/browse/SPARK-2496 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Critical > > Spark sometime store compressed data outside of Spark (e.g. event logs, > blocks in tachyon), and those data are read back directly using the codec > configured by the user. When the codec differs between runs, Spark wouldn't > be able to read the codec back. > I'm not sure what the best strategy here is yet. If we write the codec > identifier for all streams, then we will be writing a lot of identifiers for > shuffle blocks. One possibility is to only write it for blocks that will be > shared across different Spark instances (i.e. managed outside of Spark), > which includes tachyon blocks and event log blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2302) master should discard exceeded completedDrivers
[ https://issues.apache.org/jira/browse/SPARK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2302. --- Resolution: Fixed Assignee: Lianhui Wang Fix Version/s: 1.1.0 > master should discard exceeded completedDrivers > > > Key: SPARK-2302 > URL: https://issues.apache.org/jira/browse/SPARK-2302 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Lianhui Wang >Assignee: Lianhui Wang > Fix For: 1.1.0 > > > When completedDrivers number exceeds the threshold, the first > Max(spark.deploy.retainedDrivers, 1) will be discarded. > see PR: > https://github.com/apache/spark/pull/1114 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab
[ https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790924#comment-14790924 ] Josh Rosen commented on SPARK-4216: --- We now have a bot which deletes the duplicated posts after they're posted. This de-clutters the thread for folks who read it on GitHub. > Eliminate duplicate Jenkins GitHub posts from AMPLab > > > Key: SPARK-4216 > URL: https://issues.apache.org/jira/browse/SPARK-4216 > Project: Spark > Issue Type: Bug > Components: Build, Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > * [Real Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873361] > * [Imposter Jenkins | > https://github.com/apache/spark/pull/2988#issuecomment-60873366] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9715) Store numFeatures in all ML PredictionModel types
[ https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9715: - Shepherd: Joseph K. Bradley Assignee: Seth Hendrickson > Store numFeatures in all ML PredictionModel types > - > > Key: SPARK-9715 > URL: https://issues.apache.org/jira/browse/SPARK-9715 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > > The PredictionModel abstraction should store numFeatures. Currently, only > RandomForest* types do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10638) spark streaming stop gracefully keeps the spark context
Mamdouh Alramadan created SPARK-10638: - Summary: spark streaming stop gracefully keeps the spark context Key: SPARK-10638 URL: https://issues.apache.org/jira/browse/SPARK-10638 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Mamdouh Alramadan With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: `sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } ` The logs are for this process are: ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` And in my spark-defaults.conf I included `spark.streaming.stopGracefullyOnShutdowntrue` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10639) Need to convert UDAF's result from scala to sql type
Yin Huai created SPARK-10639: Summary: Need to convert UDAF's result from scala to sql type Key: SPARK-10639 URL: https://issues.apache.org/jira/browse/SPARK-10639 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Blocker We are missing a conversion at https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala#L427. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context
[ https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mamdouh Alramadan updated SPARK-10638: -- Description: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` {code} And in my spark-defaults.conf I included {code:title=spark-defaults.conf|borderStyle=solid} spark.streaming.stopGracefullyOnShutdowntrue {code} was: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming
[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context
[ https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mamdouh Alramadan updated SPARK-10638: -- Description: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: {code:title=Start.scala|borderStyle=solid} sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } {code} The logs are for this process are: {code:title=SparkLogs|borderStyle=solid} ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer 15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms 15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 144242148 ms 15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at StreamDigest.scala:21 15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at StreamDigest.scala:21) with 1 output partitions (allowLocal=true) 15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at StreamDigest.scala:21) 15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List() 15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 144242148 ms to file 'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148' 15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List() . . . . 15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written 15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? true, waited for 1 ms. 15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator 15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler ``` {code} And in my spark-defaults.conf I included {code:title=spark-defaults.conf|borderStyle=solid} `spark.streaming.stopGracefullyOnShutdowntrue` {code} was: With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful shutdown, I have seen this mailing list that [~tdas] addressed http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E which introduces a new config that was not documented, however, even with including it, the streaming job still stops correctly but the process doesn't die after all e.g. the Spark Context still running. My Mesos UI still sees the framework which is still allocating all the cores needed the code used for the shutdown hook is: `sys.ShutdownHookThread { logInfo("Received SIGTERM, calling streaming stop") streamingContext.stop(stopSparkContext = true, stopGracefully = true) logInfo("Application Stopped") } ` The logs are for this process are: ``` 5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop 15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully 15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be consumed for job generation 15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after time 144242148 15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 ms.0 from job set of time 144242148 ms 15/09/16 16:38:00 INFO
[jira] [Created] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
Thomas Graves created SPARK-10640: - Summary: Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied Key: SPARK-10640 URL: https://issues.apache.org/jira/browse/SPARK-10640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Thomas Graves I'm seeing an exception from the spark history server trying to read a history file: scala.MatchError: TaskCommitDenied (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) at org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
[ https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790777#comment-14790777 ] Thomas Graves commented on SPARK-10640: --- looks like the jsonProtocol.taskEndReasonFromJson is just missing the TaskCommitDenied TaskEndReaons. Seems like we should have a better way of handling this in the future too. Right now it doesn't load the history file at all which is pretty annoying. Perhaps have a catch all that skips it or prints unknown or something so it atleast loads. > Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied > -- > > Key: SPARK-10640 > URL: https://issues.apache.org/jira/browse/SPARK-10640 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves > > I'm seeing an exception from the spark history server trying to read a > history file: > scala.MatchError: TaskCommitDenied (of class java.lang.String) > at > org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775) > at > org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531) > at > org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org