[jira] [Updated] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10657:
---
Summary: Remove legacy SCP-based Jenkins log archiving code  (was: Remove 
legacy SSH-based Jenkins log archiving code)

> Remove legacy SCP-based Jenkins log archiving code
> --
>
> Key: SPARK-10657
> URL: https://issues.apache.org/jira/browse/SPARK-10657
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
> use our custom SSH-based mechanism for archiving Jenkins logs on the master 
> machine; this has been superseded by the use of a Jenkins plugin which 
> archives the logs and provides public viewing of them.
> We should remove the legacy log syncing code, since this is a blocker to 
> disabling Worker -> Master SSH on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10657:
---
Description: 
As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
use our custom SCP-based mechanism for archiving Jenkins logs on the master 
machine; this has been superseded by the use of a Jenkins plugin which archives 
the logs and provides public viewing of them.

We should remove the legacy log syncing code, since this is a blocker to 
disabling Worker -> Master SSH on Jenkins.

  was:
As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
use our custom SSH-based mechanism for archiving Jenkins logs on the master 
machine; this has been superseded by the use of a Jenkins plugin which archives 
the logs and provides public viewing of them.

We should remove the legacy log syncing code, since this is a blocker to 
disabling Worker -> Master SSH on Jenkins.


> Remove legacy SCP-based Jenkins log archiving code
> --
>
> Key: SPARK-10657
> URL: https://issues.apache.org/jira/browse/SPARK-10657
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
> use our custom SCP-based mechanism for archiving Jenkins logs on the master 
> machine; this has been superseded by the use of a Jenkins plugin which 
> archives the logs and provides public viewing of them.
> We should remove the legacy log syncing code, since this is a blocker to 
> disabling Worker -> Master SSH on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10657) Remove legacy SCP-based Jenkins log archiving code

2015-09-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10657:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove legacy SCP-based Jenkins log archiving code
> --
>
> Key: SPARK-10657
> URL: https://issues.apache.org/jira/browse/SPARK-10657
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
> use our custom SCP-based mechanism for archiving Jenkins logs on the master 
> machine; this has been superseded by the use of a Jenkins plugin which 
> archives the logs and provides public viewing of them.
> We should remove the legacy log syncing code, since this is a blocker to 
> disabling Worker -> Master SSH on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10658) Could pyspark provide addJars() as scala spark API?

2015-09-16 Thread ryanchou (JIRA)
ryanchou created SPARK-10658:


 Summary: Could pyspark provide addJars() as scala spark API? 
 Key: SPARK-10658
 URL: https://issues.apache.org/jira/browse/SPARK-10658
 Project: Spark
  Issue Type: Wish
  Components: PySpark
Affects Versions: 1.3.1
 Environment: Linux ubuntu 14.01 LTS
Reporter: ryanchou


My spark program was written by pyspark API , and it has used the spark-csv jar 
library. 

I could submit the task by spark-submit, and add `--jars` arguments for using 
spark-csv jar library as following commands:

```
/bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
```

However I need to run my unittests like:

```
py.test -vvs test_xxx.py
```

It could't add jars by adding '--jars' arugment.

Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
test_xxx.py. 

Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
.py, .jar). 

Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?

The codes which use addPyFile() to add jars as below: 

```
self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
sqlContext = SQLContext(self.sc)
self.dataframe = sqlContext.load(
source="com.databricks.spark.csv",
header="true",
path="xxx.csv"
)

```

While it doesn't work. sqlContext cannot load the 
source(com.databricks.spark.csv)

Eventually I have found another way to set the enviroment variable 
SPARK_CLASSPATH for loading jars libraries

```
SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
```

It could load the jars libraries and sqlContext could load source succeed as 
well as adding `--jar xxx1.jar` arguments


For the siuation on using third party jars (.py & .egg could work well by using 
addPyFile()) in pyspark-written scripts.
and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).

have you ever planed to provide an API such as addJars() in scala for adding 
jars to spark program, or was there a better way to add jars I still havent 
found it yet?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10658) Could pyspark provide addJars() as scala spark API?

2015-09-16 Thread ryanchou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ryanchou updated SPARK-10658:
-
Description: 
My spark program was written by pyspark API , and it has used the spark-csv jar 
library. 

I could submit the task by spark-submit, and add `--jars` arguments for using 
spark-csv jar library as following commands:

```
/bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
```

However I need to run my unittests like:

```
py.test -vvs test_xxx.py
```

It could't add jars by adding '--jars' arugment.

Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
test_xxx.py. 

Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
.py, .jar). 

Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?

The codes which using addPyFile() to add jars as below: 

```
self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
sqlContext = SQLContext(self.sc)
self.dataframe = sqlContext.load(
source="com.databricks.spark.csv",
header="true",
path="xxx.csv"
)

```

While it doesn't work. sqlContext cannot load the 
source(com.databricks.spark.csv)

Eventually I have found another way to set the enviroment variable 
SPARK_CLASSPATH for loading jars libraries

```
SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
```

It could load the jars libraries and sqlContext could load source succeed as 
well as adding `--jar xxx1.jar` arguments

For the situation on using third party jars (.py & .egg could work well by 
using addPyFile()) in pyspark-written scripts.
and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).

Have you ever planed to provide an API such as addJars() in scala for adding 
jars to spark program, or was there a better way to add jars I still havent 
found it yet?

If someone want to addjars() in pyspark-written scripts not using '--jars'. 

Could you give us some suggestions on it?


  was:
My spark program was written by pyspark API , and it has used the spark-csv jar 
library. 

I could submit the task by spark-submit, and add `--jars` arguments for using 
spark-csv jar library as following commands:

```
/bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
```

However I need to run my unittests like:

```
py.test -vvs test_xxx.py
```

It could't add jars by adding '--jars' arugment.

Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
test_xxx.py. 

Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
.py, .jar). 

Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?

The codes which using addPyFile() to add jars as below: 

```
self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
sqlContext = SQLContext(self.sc)
self.dataframe = sqlContext.load(
source="com.databricks.spark.csv",
header="true",
path="xxx.csv"
)

```

While it doesn't work. sqlContext cannot load the 
source(com.databricks.spark.csv)

Eventually I have found another way to set the enviroment variable 
SPARK_CLASSPATH for loading jars libraries

```
SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
```

It could load the jars libraries and sqlContext could load source succeed as 
well as adding `--jar xxx1.jar` arguments

For the situation on using third party jars (.py & .egg could work well by 
using addPyFile()) in pyspark-written scripts.
and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).

Have you ever planed to provide an API such as addJars() in scala for adding 
jars to spark program, or was there a better way to add jars I still havent 
found it yet?




> Could pyspark provide addJars() as scala spark API? 
> 
>
> Key: SPARK-10658
> URL: https://issues.apache.org/jira/browse/SPARK-10658
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.3.1
> Environment: Linux ubuntu 14.01 LTS
>Reporter: ryanchou
>  Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> My spark program was written by pyspark API , and it has used the spark-csv 
> jar library. 
> I could submit the task by spark-submit, and add `--jars` arguments for using 
> spark-csv jar library as following commands:
> ```
> /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
> ```
> However I need to run my unittests like:
> ```
> py.test -vvs test_xxx.py
> ```
> It could't add jars by adding '--jars' arugment.
> Therefore I tried to use the SparkContext.addPyFile() API 

[jira] [Commented] (SPARK-10606) Cube/Rollup/GrpSet doesn't create the correct plan when group by is on something other than an AttributeReference

2015-09-16 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791499#comment-14791499
 ] 

Cheng Hao commented on SPARK-10606:
---

[~rhbutani] Which version are you using, actually I've fixed the bug at 
SPARK-8972, it should be included in 1.5. Can you try that with 1.5?

> Cube/Rollup/GrpSet doesn't create the correct plan when group by is on 
> something other than an AttributeReference
> -
>
> Key: SPARK-10606
> URL: https://issues.apache.org/jira/browse/SPARK-10606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Harish Butani
>Priority: Critical
>
> Consider the following table: t(a : String, b : String) and the query
> {code}
> select a, concat(b, '1'), count(*)
> from t
> group by a, concat(b, '1') with cube
> {code}
> The projections in the Expand operator are not setup correctly. The expand 
> logic in Analyzer:expand is comparing grouping expressions against 
> child.output. So {{concat(b, '1')}} is never mapped to a null Literal.  
> A simple fix is to add a Rule to introduce a Projection below the 
> Cube/Rollup/GrpSet operator that additionally projects the   
> groupingExpressions that are missing in the child.
> Marking this as Critical, because you get wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10659) SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-16 Thread Vladimir Picka (JIRA)
Vladimir Picka created SPARK-10659:
--

 Summary: SparkSQL saveAsParquetFile does not preserve REQUIRED 
(not nullable) flag in schema
 Key: SPARK-10659
 URL: https://issues.apache.org/jira/browse/SPARK-10659
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0
Reporter: Vladimir Picka


DataFrames currently automatically promotes all Parquet schema fields to 
optional when they are written to an empty directory. The problem remains in 
v1.5.0.

The culprit is this code:
val relation = if (doInsertion) {
  // This is a hack. We always set
nullable/containsNull/valueContainsNull to true
  // for the schema of a parquet data.
  val df =
sqlContext.createDataFrame(
  data.queryExecution.toRdd,
  data.schema.asNullable)
  val createdRelation =
createRelation(sqlContext, parameters,
df.schema).asInstanceOf[ParquetRelation2]
  createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
  createdRelation
}

which was implemented as part of this PR:
https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b

This very unexpected behaviour for some use cases when files are read from one 
place and written to another like small file packing - it ends up with 
incompatible files because required can't be promoted to optional normally. It 
is essence of a schema that it enforces "required" invariant on data. It should 
be supposed that it is intended.

I believe that a better approach is to have default behaviour to keep schema as 
is and provide f.e. a builder method or option to allow forcing to optional.

Right now we have to overwrite private API so that our files are rewritten as 
is with all its perils.

Vladimir




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-16 Thread Vladimir Picka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Picka updated SPARK-10659:
---
Summary: DataFrames and SparkSQL saveAsParquetFile does not preserve 
REQUIRED (not nullable) flag in schema  (was: SparkSQL saveAsParquetFile does 
not preserve REQUIRED (not nullable) flag in schema)

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> val relation = if (doInsertion) {
>   // This is a hack. We always set
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters,
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10659) DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not nullable) flag in schema

2015-09-16 Thread Vladimir Picka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791613#comment-14791613
 ] 

Vladimir Picka commented on SPARK-10659:


Here is unanswered attempt for discussion on a mailing list:
https://mail.google.com/mail/#search/label%3Aspark-user+petr/14f64c75c15f5ccd

> DataFrames and SparkSQL saveAsParquetFile does not preserve REQUIRED (not 
> nullable) flag in schema
> --
>
> Key: SPARK-10659
> URL: https://issues.apache.org/jira/browse/SPARK-10659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0
>Reporter: Vladimir Picka
>
> DataFrames currently automatically promotes all Parquet schema fields to 
> optional when they are written to an empty directory. The problem remains in 
> v1.5.0.
> The culprit is this code:
> val relation = if (doInsertion) {
>   // This is a hack. We always set
> nullable/containsNull/valueContainsNull to true
>   // for the schema of a parquet data.
>   val df =
> sqlContext.createDataFrame(
>   data.queryExecution.toRdd,
>   data.schema.asNullable)
>   val createdRelation =
> createRelation(sqlContext, parameters,
> df.schema).asInstanceOf[ParquetRelation2]
>   createdRelation.insert(df, overwrite = mode == SaveMode.Overwrite)
>   createdRelation
> }
> which was implemented as part of this PR:
> https://github.com/apache/spark/commit/1b490e91fd6b5d06d9caeb50e597639ccfc0bc3b
> This very unexpected behaviour for some use cases when files are read from 
> one place and written to another like small file packing - it ends up with 
> incompatible files because required can't be promoted to optional normally. 
> It is essence of a schema that it enforces "required" invariant on data. It 
> should be supposed that it is intended.
> I believe that a better approach is to have default behaviour to keep schema 
> as is and provide f.e. a builder method or option to allow forcing to 
> optional.
> Right now we have to overwrite private API so that our files are rewritten as 
> is with all its perils.
> Vladimir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8547) xgboost exploration

2015-09-16 Thread Tian Jian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791472#comment-14791472
 ] 

Tian Jian Wang commented on SPARK-8547:
---

Venkata  I have started on this as a pet project before. If you have yet 
started, can I try?

> xgboost exploration
> ---
>
> Key: SPARK-8547
> URL: https://issues.apache.org/jira/browse/SPARK-8547
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>
> There has been quite a bit of excitement around xgboost: 
> [https://github.com/dmlc/xgboost]
> It improves the parallelism of boosting by mixing boosting and bagging (where 
> bagging makes the algorithm more parallel).
> It would worth exploring implementing this within MLlib (probably as a new 
> algorithm).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10657) Remove legacy SSH-based Jenkins log archiving code

2015-09-16 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-10657:
--

 Summary: Remove legacy SSH-based Jenkins log archiving code
 Key: SPARK-10657
 URL: https://issues.apache.org/jira/browse/SPARK-10657
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen


As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to 
use our custom SSH-based mechanism for archiving Jenkins logs on the master 
machine; this has been superseded by the use of a Jenkins plugin which archives 
the logs and provides public viewing of them.

We should remove the legacy log syncing code, since this is a blocker to 
disabling Worker -> Master SSH on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10658) Could pyspark provide addJars() as scala spark API?

2015-09-16 Thread ryanchou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ryanchou updated SPARK-10658:
-
Description: 
My spark program was written by pyspark API , and it has used the spark-csv jar 
library. 

I could submit the task by spark-submit, and add `--jars` arguments for using 
spark-csv jar library as following commands:

```
/bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
```

However I need to run my unittests like:

```
py.test -vvs test_xxx.py
```

It could't add jars by adding '--jars' arugment.

Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
test_xxx.py. 

Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
.py, .jar). 

Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?

The codes which using addPyFile() to add jars as below: 

```
self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
sqlContext = SQLContext(self.sc)
self.dataframe = sqlContext.load(
source="com.databricks.spark.csv",
header="true",
path="xxx.csv"
)

```

While it doesn't work. sqlContext cannot load the 
source(com.databricks.spark.csv)

Eventually I have found another way to set the enviroment variable 
SPARK_CLASSPATH for loading jars libraries

```
SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
```

It could load the jars libraries and sqlContext could load source succeed as 
well as adding `--jar xxx1.jar` arguments

For the situation on using third party jars (.py & .egg could work well by 
using addPyFile()) in pyspark-written scripts.
and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).

Have you ever planed to provide an API such as addJars() in scala for adding 
jars to spark program, or was there a better way to add jars I still havent 
found it yet?



  was:
My spark program was written by pyspark API , and it has used the spark-csv jar 
library. 

I could submit the task by spark-submit, and add `--jars` arguments for using 
spark-csv jar library as following commands:

```
/bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
```

However I need to run my unittests like:

```
py.test -vvs test_xxx.py
```

It could't add jars by adding '--jars' arugment.

Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
test_xxx.py. 

Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
.py, .jar). 

Does it mean that I could add *.jar (jar libraries) by using the addPyFile()?

The codes which use addPyFile() to add jars as below: 

```
self.sc.addPyFile(join(lib_path, "spark-csv_2.10-1.1.0.jar"))
sqlContext = SQLContext(self.sc)
self.dataframe = sqlContext.load(
source="com.databricks.spark.csv",
header="true",
path="xxx.csv"
)

```

While it doesn't work. sqlContext cannot load the 
source(com.databricks.spark.csv)

Eventually I have found another way to set the enviroment variable 
SPARK_CLASSPATH for loading jars libraries

```
SPARK_CLASSPATH="/path/xxx.jar:/path/xxx2.jar" py.test -vvs test_xxx.py
```

It could load the jars libraries and sqlContext could load source succeed as 
well as adding `--jar xxx1.jar` arguments


For the siuation on using third party jars (.py & .egg could work well by using 
addPyFile()) in pyspark-written scripts.
and it cannot use `--jars` on the situation (py.test -vvs test_xxx.py).

have you ever planed to provide an API such as addJars() in scala for adding 
jars to spark program, or was there a better way to add jars I still havent 
found it yet?




> Could pyspark provide addJars() as scala spark API? 
> 
>
> Key: SPARK-10658
> URL: https://issues.apache.org/jira/browse/SPARK-10658
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.3.1
> Environment: Linux ubuntu 14.01 LTS
>Reporter: ryanchou
>  Labels: features
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> My spark program was written by pyspark API , and it has used the spark-csv 
> jar library. 
> I could submit the task by spark-submit, and add `--jars` arguments for using 
> spark-csv jar library as following commands:
> ```
> /bin/spark-submit --jars /path/spark-csv_2.10-1.1.0.jar  xxx.py
> ```
> However I need to run my unittests like:
> ```
> py.test -vvs test_xxx.py
> ```
> It could't add jars by adding '--jars' arugment.
> Therefore I tried to use the SparkContext.addPyFile() API to add jars in my 
> test_xxx.py. 
> Because I saw the addPyFile()'s doc mention me PACKAGES_EXTENSIONS = (.zip, 
> .py, 

[jira] [Resolved] (SPARK-10504) aggregate where NULL is defined as the value expression aborts when SUM used

2015-09-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10504.
--
   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 1.5.0

> aggregate where NULL is defined as the value expression aborts when SUM used
> 
>
> Key: SPARK-10504
> URL: https://issues.apache.org/jira/browse/SPARK-10504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1
>Reporter: N Campbell
>Assignee: Yin Huai
>Priority: Minor
> Fix For: 1.5.0
>
>
> In ISO-SQL the context would determine an implicit type for NULL or one might 
> find that a vendor requires an explicit type via CAST ( NULL as INTEGER). It 
> appears that SPARK presumes a long type i.e. select min(NULL), max(NULL) but 
> a query such the following aborts.
>  
> {{select sum ( null  )  from tversion}}
> {code}
> Operation: execute
> Errors:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 5232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 5232.0 (TID 18531, sandbox.hortonworks.com): scala.MatchError: NullType (of 
> class org.apache.spark.sql.types.NullType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$cast(Cast.scala:403)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:422)
>   at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:426)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:119)
>   at 
> org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:581)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:133)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2496) Compression streams should write its codec info to the stream

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2496.
---
Resolution: Incomplete

Resolving as "Incomplete"; if we still want to do this then we should wait 
until we have a specific concrete use-case / list of things that need to be 
changed.

> Compression streams should write its codec info to the stream
> -
>
> Key: SPARK-2496
> URL: https://issues.apache.org/jira/browse/SPARK-2496
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark sometime store compressed data outside of Spark (e.g. event logs, 
> blocks in tachyon), and those data are read back directly using the codec 
> configured by the user. When the codec differs between runs, Spark wouldn't 
> be able to read the codec back. 
> I'm not sure what the best strategy here is yet. If we write the codec 
> identifier for all streams, then we will be writing a lot of identifiers for 
> shuffle blocks. One possibility is to only write it for blocks that will be 
> shared across different Spark instances (i.e. managed outside of Spark), 
> which includes tachyon blocks and event log blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2302) master should discard exceeded completedDrivers

2015-09-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2302.
---
   Resolution: Fixed
 Assignee: Lianhui Wang
Fix Version/s: 1.1.0

> master should discard exceeded completedDrivers 
> 
>
> Key: SPARK-2302
> URL: https://issues.apache.org/jira/browse/SPARK-2302
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
> Fix For: 1.1.0
>
>
> When completedDrivers number exceeds the threshold, the first 
> Max(spark.deploy.retainedDrivers, 1) will be discarded.
> see PR:
> https://github.com/apache/spark/pull/1114



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2015-09-16 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790924#comment-14790924
 ] 

Josh Rosen commented on SPARK-4216:
---

We now have a bot which deletes the duplicated posts after they're posted. This 
de-clutters the thread for folks who read it on GitHub.

> Eliminate duplicate Jenkins GitHub posts from AMPLab
> 
>
> Key: SPARK-4216
> URL: https://issues.apache.org/jira/browse/SPARK-4216
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> * [Real Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873361]
> * [Imposter Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9715) Store numFeatures in all ML PredictionModel types

2015-09-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9715:
-
Shepherd: Joseph K. Bradley
Assignee: Seth Hendrickson

> Store numFeatures in all ML PredictionModel types
> -
>
> Key: SPARK-9715
> URL: https://issues.apache.org/jira/browse/SPARK-9715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> The PredictionModel abstraction should store numFeatures.  Currently, only 
> RandomForest* types do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)
Mamdouh Alramadan created SPARK-10638:
-

 Summary: spark streaming stop gracefully keeps the spark context
 Key: SPARK-10638
 URL: https://issues.apache.org/jira/browse/SPARK-10638
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Mamdouh Alramadan


With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

`sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }
`

The logs are for this process are:
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```

And in my spark-defaults.conf I included

`spark.streaming.stopGracefullyOnShutdowntrue`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10639) Need to convert UDAF's result from scala to sql type

2015-09-16 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10639:


 Summary: Need to convert UDAF's result from scala to sql type
 Key: SPARK-10639
 URL: https://issues.apache.org/jira/browse/SPARK-10639
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Blocker


We are missing a conversion at 
https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala#L427.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mamdouh Alramadan updated SPARK-10638:
--
Description: 
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```
{code}
And in my spark-defaults.conf I included
{code:title=spark-defaults.conf|borderStyle=solid}
spark.streaming.stopGracefullyOnShutdowntrue
{code}

  was:
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming 

[jira] [Updated] (SPARK-10638) spark streaming stop gracefully keeps the spark context

2015-09-16 Thread Mamdouh Alramadan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mamdouh Alramadan updated SPARK-10638:
--
Description: 
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

{code:title=Start.scala|borderStyle=solid}
sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }

{code}

The logs are for this process are:
{code:title=SparkLogs|borderStyle=solid}
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Stopped generation timer
15/09/16 16:38:00 INFO JobGenerator: Waiting for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO JobScheduler: Added jobs for time 144242148 ms
15/09/16 16:38:00 INFO JobGenerator: Checkpointing graph for time 144242148 
ms
15/09/16 16:38:00 INFO DStreamGraph: Updating checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO DStreamGraph: Updated checkpoint data for time 
144242148 ms
15/09/16 16:38:00 INFO SparkContext: Starting job: foreachRDD at 
StreamDigest.scala:21
15/09/16 16:38:00 INFO DAGScheduler: Got job 12 (foreachRDD at 
StreamDigest.scala:21) with 1 output partitions (allowLocal=true)
15/09/16 16:38:00 INFO DAGScheduler: Final stage: ResultStage 12(foreachRDD at 
StreamDigest.scala:21)
15/09/16 16:38:00 INFO DAGScheduler: Parents of final stage: List()
15/09/16 16:38:00 INFO CheckpointWriter: Saving checkpoint for time 
144242148 ms to file 
'hdfs://EMRURL/sparkStreaming/checkpoint/checkpoint-144242148'
15/09/16 16:38:00 INFO DAGScheduler: Missing parents: List()
.
.
.
.
15/09/16 16:38:00 INFO JobGenerator: Waited for jobs to be processed and 
checkpoints to be written
15/09/16 16:38:00 INFO CheckpointWriter: CheckpointWriter executor terminated ? 
true, waited for 1 ms.
15/09/16 16:38:00 INFO JobGenerator: Stopped JobGenerator
15/09/16 16:38:00 INFO JobScheduler: Stopped JobScheduler

```
{code}
And in my spark-defaults.conf I included
{code:title=spark-defaults.conf|borderStyle=solid}
`spark.streaming.stopGracefullyOnShutdowntrue`
{code}

  was:
With spark 1.4 on Mesos cluster, I am trying to stop the context with graceful 
shutdown, I have seen this mailing list that [~tdas] addressed

http://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201505.mbox/%3c176cb228a2704ab996839fb97fa90...@git.apache.org%3E

which introduces a new config that was not documented, however, even with 
including it, the streaming job still stops correctly but the process doesn't 
die after all e.g. the Spark Context still running. My Mesos UI still sees the 
framework which is still allocating all the cores needed

the code used for the shutdown hook is:

`sys.ShutdownHookThread {
logInfo("Received SIGTERM, calling streaming stop")
streamingContext.stop(stopSparkContext = true, stopGracefully = true)
logInfo("Application Stopped")
  }
`

The logs are for this process are:
```
5/09/16 16:37:51 INFO Start: Received SIGTERM, calling streaming stop
15/09/16 16:37:51 INFO JobGenerator: Stopping JobGenerator gracefully
15/09/16 16:37:51 INFO JobGenerator: Waiting for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO JobGenerator: Waited for all received blocks to be 
consumed for job generation
15/09/16 16:37:51 INFO StreamingContext: Invoking stop(stopGracefully=true) 
from shutdown hook
15/09/16 16:38:00 INFO RecurringTimer: Stopped timer for JobGenerator after 
time 144242148
15/09/16 16:38:00 INFO JobScheduler: Starting job streaming job 144242148 
ms.0 from job set of time 144242148 ms
15/09/16 16:38:00 INFO 

[jira] [Created] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10640:
-

 Summary: Spark history server fails to parse taskEndReasonFromJson 
TaskCommitDenied
 Key: SPARK-10640
 URL: https://issues.apache.org/jira/browse/SPARK-10640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Thomas Graves


I'm seeing an exception from the spark history server trying to read a history 
file:

scala.MatchError: TaskCommitDenied (of class java.lang.String)
at 
org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
at 
org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
at 
org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-16 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790777#comment-14790777
 ] 

Thomas Graves commented on SPARK-10640:
---

looks like the jsonProtocol.taskEndReasonFromJson is just missing the 
TaskCommitDenied TaskEndReaons.

Seems like we should have a better way of handling this in the future too.  
Right now it doesn't load the history file at all which is pretty annoying.  
Perhaps have a catch all that skips it or prints unknown or something so it 
atleast loads. 

> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>
> I'm seeing an exception from the spark history server trying to read a 
> history file:
> scala.MatchError: TaskCommitDenied (of class java.lang.String)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
> at 
> org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3